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Congratulations on downloading Machine Learning for absolute beginners: 
step-by-step guide to learning and mastering machine learning for absolute 
beginners. 


The object of this book is providing an in-depth comprehension of machine 
learning methods to absolute beginners. This book explains the basics and 
fundamentals of machine learning that will help acquire and develop basic 


understanding of machine learning technics. 


Machine learning is used nowadays in different fields in our era. This book 
will help to understand the concepts of machine learning, the different types 
of machine learning as well as when they are applied. This book has four 
main chapters. The first chapter presents the big picture of the concepts of 
machine learning. The second chapter provide a presentation of the 


different types of machines learning paradigms. The third chapter give an 
in-depth explanation of artificial neural networks. The fourth chapter give a 
presentation of the different algorithms used to train machine learning 


models as well as artificial neural networks. 


In the first chapter of this book, we provide an introduction of machine 
learning concept. We explain the philosophy behind machine learning. We 
also explain when machine learning can be applied and when it is better to 
use other alternative. We present the advantages and the challenges of 
implementing a machine learning model to solve a specific problem. We 
also list some examples of machine learning applications in this chapter. 
Finally, we present briefly how a machine learning model can be 


implemented. 


In chapter two, we go through the different types of machine learning 
namely Supervised learning, unsupervised learning, semi-supervised 
learning and reinforcement learning. For each type of machine learning we 
explain for which application it can be used. We also explain the advantages 


and the challenges of each type. 


Chapter three of this book provide the details of artificial neural networks. 
It explains first the fundamentals and the concept of artificial neural 
networks. It also presents the different components of an artificial neural 
network which include neurons and activation functions. In this chapter we 
explain how neurons works. We also present the different activation 
functions that can be used in artificial neural networks and we provide the 
difference between these activation functions as well as when each 


activation function is best used. 


In this chapter we go through the different major types of artificial neural 
networks as well as the when each type is used. We explain in this chapter 
how to develop an artificial neural network. We also present the general 
rules that should be considered in developing an artificial neural network, 
the loss functions that can be used. We explain as well how we should split 
and use the data to train an artificial neural network as well as the concepts 
behind feed forward and back propagation to train artificial neural 
networks. Finally, we explain briefly the procedure to train an artificial 


neural network. 


Chapter four of this book, tackles the learning algorithm. We present the 
gradient descent algorithm as well the variants of the gradient descent 
algorithm namely the stochastic, batch and mini-batch gradient descent 
algorithms. We also present the Adam algorithm, a recent optimization 


algorithm developed specifically for training artificial neural networks. 


Machine learning is a powerful tool to convert data into valuable 
information that serves a particular purpose. In this book, we will learn how 
to use this tool to develop powerful models and different technics that can 


be used. 


There are plenty of books on this subject on the market, thanks again for 
choosing this one! Every effort was made to ensure it is full of as much 


useful information as possible, please enjoy! 


Chapter 1: Basics of machine learning 
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Nowadays machine learning is used and influence every aspect of our lives. 
Marketing, health care, social media, banking systems, stock market and 
many other systems use machine learning to target clients, provide 
customized products, to filter and detect disease, anomaly, fraudulent 
transactions as well as to grow and improve business. Machine learning 
relies on the concept of developing cost-effective programs that learns from 
the data in order to identify trends and make optimal decisions with 
minimum human intervention. The powerful computer systems and the 
digitalized data available in this era made machine learning a powerful tool 


to solve complex problems efficiently at a low cost. 


In this chapter, we will first go into an in-depth explanation of machine 
learning concept. We will define cases when to use machine learning and 


cases when other simpler deterministic approaches are the go-to approach 


to solve the problem in hand. Finally, we will explain the advantages of 
machine learning and provide examples of problems that can be solved by 
machine learning. Let’s first start with understanding what machine 


learning is. 


{ What is machine learning? 


Machine learning is whole complete branch of Artificial Intelligence and 
computer science. The later is an emerging science to build machine 
programs that are able to mimic the human behavior and take decisions 
efficiently like a human would do. Artificial intelligence relies on several 
algorithms to build these intelligent systems that requires no human 


guidance to make predictions and decisions. 


The idea is, instead of developing algorithms with imposed rules to perform 
a specific task, we develop algorithms that learn from themselves to 
perform this task. These algorithms are what we call machine learning. The 
role of machine learning algorithms is to acquire knowledge and learn as 
much as possible from data in order to identify trends, make future 
predictions or decisions based on the learned knowledge. Machine learning 
combine statistical methods and the power of computers to detect hidden 


patterns and behaviors in order to perform predictions. 


For instance, machine learning proposes recommendations of news, videos, 
movies, etc. on YouTube, Netflix, websites based on the information 
gathered about you when you consulted these websites. These platforms 
collect the maximum of data about you to learn your behaviors. Then it 
predicts what you might like and provide recommendations based on that. 
Cloud computing and the increasing storage capacity of machines as well as 
the personalized chips have served and gave the major breakthrough of 


machine learning advancements. 


Machine learning are data driven programs in the sense that they rely solely 
on data to make predictions and decisions. They learn and acquire 
knowledge only from data and improve their performance with more data. 
They rely on probabilities rather than logic to map the hidden patterns in 
data. Machine learning works mainly as a black box that takes as input a set 
of data, processes it and produce an outcome. We refer to machine learning 
as a black box because we don’t understand the details of the relationship 
that relates the outcome produced by the machine learning to the input data. 


Machine learning tries to emulate in some way how the human brain works. 


For example, when we play a game, we make a guess about the most likely 
action to make in order to win the game. When driving somewhere, we try 
to find the best and fastest route to reach our desired destination. When 
making a trade, we try to make the best trade with the least risk and high 
income generating. All these example tasks are done based on the human 
intelligence with unconsciously made decisions. Instead of analyzing and 
reasoning, the brain unconsciously tries to estimate how likely something 
can happen. For instance, the stock is a high volatile market where logic 
and reasoning might not be the choice to make decision but probabilities 


based on passed experiences might be a more successful strategy. 


On the same concept, machine learning makes decisions and predictions on 
how likely something can happen based on past experiences. These 
experiences are reflected by the collected datasets. Machine learning is very 
powerful at predicting how likely something might happen or to be in a 


specific state because machines are very efficient to compute probabilities. 


Because of the emerging powerful machines and computers that are to 
make fast calculations, another type of learning has been develop which is 
deep learning. Deep learning is a domain of machine learning that consist of 
breaking a problem into several chunks of sub-problems to be treated 


separately. The deep learning is mainly based on artificial neural networks. 


The neural networks were developed to mimic the connections and 
functioning of the neurons in the human brain. When artificial neural 
networks are fed with data or information, they break the information into 
chunks where each chunk is processed by a neuron. Then these neurons are 
connected to each other to communicate the information in order to produce 
a targeted outcome. For example, we a person look at an image of an 
animal, the human brain process information like color, size, shape among 
others to make a guess about the animal in the picture. In the same way, 
artificial neural networks break the same information based on the pixels of 


the image to recognize the animal in the picture. 


Machine learning algorithms are powerful tools as long as the data are 
accurate. In fact, as mentioned before machine learning are a data-driven 
programs. They rely on data to make predictions. In order to develop a fully 
working and performant machine learning model ready to make predictions 


and decisions, this model should be trained and fitted on data. 


In other words, we should train the model on the input data and the 
expected outcome first. This is usually done with an optimization algorithm 
the minimize the error of the predicted value by the model. We will go into 
the details of the process to build a machine learning model later in this 


chapter. Now that we understand the concept behind machine learning, let’s 


learn when machine learning is the go-to method to solve or problem and 


when to avoid it. 


2? When to use and when to avoid machine 


learning? 


In order to make the best outcome and fully take advantage of machine 
learning, it is very important to understand when to use machine learning 
and when it is best to avoid machine learning. Indeed, some problems are 


best solved with a simple deterministic rule-based approach. 


A rule-based approach is a system that consist of an ensemble of rules that 
are imposed based on the knowledge or the expertise of user. These rules 
are imposed to perform a specific task. These systems are also called expert 
systems because they rely mainly on the expertise of the user. These 
systems are based on rules developed using logic and reasoning to perform 
a task unlike machine learning that are based on probabilities to perform the 


same task. 


Let’s consider for example a person applies for a loan from a bank. A rule- 
based system would for example, set a rule if a person has an income under 
$1000, the requested loan is denied. Therefore, this person would be 
accorded the loan only if it has an income over $1000. A machine learning 
program would use other information about this person, estimate how likely 
this person is able to refund the loan then make a decision whether to 


accord or deny the loan to this person. 


Machine learning is a data-driven approach. Hence it implies availability of 
large datasets in order to make accurate decisions. In case only a limited 


dataset is available to solve a particular problem, it is best to use a 


deterministic approach. When limited dataset is available it is hard to train a 
machine learning model and generalize its applicability to other similar 
problem. The developed model in this case is only applicable for the few 


data it was trained on. 


Because machine learning methods rely solely only on data and the human 
expertise or judgment is not taken in consideration, it is the data that dictate 
if the machine learning method will fail or succeed to perform the task it 
was designed for. The way a machine learning approach works is that a 
modeler develops a learning algorithm. Then the modeler feed the learning 


algorithm with the data and information. 


The algorithm learns by itself from the data with no guidance or human 
interference. It is the algorithm that builds the system. If the data provided 
for the algorithm is of poor quality and biased then the system is also of 
poor quality and biased. Hence, cleaning and acquiring the right data to 
solve a problem with machine learning problem is very crucial. If the data 
are biases and noisy it is better to stick with a traditional method. Otherwise 
the machine learning method will memorize the noise and provide 


inaccurate results. 


Because they are developed to process large dataset, machine learning 
methods require significant time and computation resources and storage 
capacity. If these resources are limited then a deterministic method might be 


a better approach to adopt. 


Machine learning methods comes handy in three scenarios: 1)when the 


human expertise is very limited to solve a complex problem, 2) when it is 


difficult to transform the human expertise into a program, 3) or when it time 
consuming to setup up procedure to solve a simple problem although the 
human expertise and the possibility of implementing that expertise into a 


program. 


Let’s the example of filtering emails into Inbox, spams, and ad. A human 
can easily filter an email as spam or ad. However, it would very tedious to 
do the same task for large number of emails. A machine learning algorithm 


can do the same task more efficiently. 


learning 


There are lot of advantages of using machine learning in cases it is suited to 
use a machine learning approach. First of all, machine learning programs 
are able to convert data into valuable information. That is the core idea 
behind machine learning which is making data reveal hidden information 
and patterns to improve our understanding of different phenomena whether 
it is for financial improvement and increase profit or for medical and human 


understanding. 


Machine learning programs are able to easily identify trend and hidden 
patterns in large datasets that would be tedious for human beings to detect. 
Another advantage of machine learning is no human intervention or 
expertise is needed to learn from the data. Machine learning programs are 
continuously improving their accuracy and gaining experience with the 
increasing of data quantity. As long as data is growing and fed to the 
machine learning algorithms, they keep improving their performance and 


making more accurate predictions and decisions. 


The major advantage of machine learning is that they are able of handling 
multi-variable and multi-dimensional dataset. 

Although its advantages and their efficiency, machine learning algorithms 
have some disadvantages. The main disadvantage is the need to acquire 
large dataset with good quality and unbiased. Implementing a machine 
learning approach has some challenges which are the ability to interpret the 


results and to detect the susceptible errors. 


Indeed, machine learning methods are able to extract valuable information 
from data. But without appropriate interpretation that valuable information 
is not useful. Also, without the ability to detect susceptible errors that might 
be caused by plausible data bias, that information provided by the machine 


learning algorithm might be misleading and would yield to biased results. 


Overall, the main challenge to implement a machine learning algorithm is 
acquiring good quality data and having the capacity to analyze the data and 
interpret the results together with the ability to point towards the plausible 
source of errors. Another direct challenge of implementing a machine 
learning algorithm is the ability to convert raw data into formatted usable 


data that can be processed machine leaming algorithms. 


Machine learning nowadays is the major advancement technology 
worldwide. It is present in lot of fields. The availability of large quantities 
of data, the increasing performance of computers are the reason for the 
dominance of machine learning in every field. In this sub-section we will 


talk about few fields where machine learning is used. 


Machine learning is very handy for data security and anomaly detection in 
systems. For instance, in a cloud system machine learning algorithm 
identify patterns on ways of gaining access to the cloud to flag anomalies 
that could be security flaws. In the same category fraud detection and 
suspicious transactions in general are better detected with machine learning 


algorithms. 


Stock market and financial trading now benefit from the power of machine 
learning. The reason for that is machines are very efficient in computing 
probabilities and processing the huge quantities of data at a high speed. The 
stock market is high volatile stochastic market where trades are made on 
high scale of speed and volume. Humans cannot compete with machines at 
this level. Many firms nowadays rely on machines to make predictions and 


perform trades. 


Healthcare is another field that now uses machine learning methods. They 
are mainly used to analyze scans and MRIs to detect diseases. They are also 


used to learn the causes or risk factors that lead to a certain disease. 


Marketing and e-commerce are fields that benefit the most from machine 
learning methods to target clients and improve profits. More precisely 
customized marketing is a field that relies basically on machine learning. It 
is based on the concept of the more you know your clients and their 


behaviors the better you can provide them customized services. 


Hence, the more you sell and increase profit. These systems gather 
information from the clients according to the website they visit and the 
products they have seen without buying or they bought. Then, the system 
will generate e-mails or ads and/or coupons of the product they checked 


without buying or similar products. 


Using the same concept as customized marketing, recommendations on 
Netflix, YouTube, Facebook, Amazon among others, are based on machine 
learning algorithms. The algorithms analyze your behaviors and activities to 


make a guess about what you might like to buy or watch. 


These systems get smarter and efficient with the more data they can gather. 
Speech and image recognition as well as natural language processing are 


also a widely application of machine learning algorithms. 


Now that you understand the philosophy behind machine learning and 
potential applications of machine learning as well as the challenges to 
implement it, let’s go through the major steps to develop a machine learning 


model to solve any problem. 


> How to develop a machine learning model? 


Before starting any modelling, it is essential to fix beforehand the 
objectives and the outcome of implementing a machine learning model. At 
this step, you should set the goals on the key problem that would be solved 
and the questions that you are trying to answer by the machine learning 
model. First, you should set hypotheses regarding the problem you have in 
hand as well as some potential strategies to solve it and plausible inputs to 
feed the model. 


Developing a machine learning model becomes then an iterative process 
that involves going through the cycle of setting a hypothesis, 
testing/training and validating. Collecting and gathering data is a crucial 
step in developing a machine learning model. This step not only implies 
identifying and collecting the required data to test a hypothesis but also 
cleaning and formatting these data into usable format. At this phase, you 
should make sure the data are unbiased and does not include outliers that 


may impact the results produced by the model. 


Analyzing and visualizing these data help detect bias and verify the quality 
of the data. After the collection and the preparation of data you can build a 
model and test it against a set of data then validate it against new data. The 
testing process implies training the model and adjusting parameters that 
impact the model accuracy. Validation of the model helps assessing the 


model accuracy based on performance metrics. 


At this stage, you assess whether the model produces the expected results 
and answer to the questions you fixed at the beginning. You can iterate the 
process by testing different hypothesis or models until you reach 


satisfactory results and the desired outcome. 


In the next chapters of the book, we will learn the different types of 
machine learning that can be used and when they applicable. We will also 
expand over how to develop a machine learning model and presenting the 


tools to test and validate a machine learning model. 


Chapter 2: Machine learning types 





The are different types of machine learning models that can be classed into 
four categories: supervised, unsupervised, semi-supervised and 


reinforcement learning. 


2.1 Supervised Learning 


Supervised learning is a machine learning paradigm with a goal of 
estimating a relationship between inputs and outputs. This function is called 
the mapping function. The mapping function describes the hidden patterns 
in the data. In supervised learning we typically have a labeled dataset that 


consist of pairs of input data (usually a vector) and a target output. 


The target output is a value associated with the input data. The supervised 
learning algorithm process the training labeled dataset in order to estimates 
a mapping function that is used to estimate or predict similar new datasets. 
In other words, the learning algorithm is generalized into new unseen 


examples with acceptable accuracy. 


Let’s consider a vector X of input data and Y is the target output value. 
Formally, the supervised learning algorithm tries to estimates the mapping 
function f such as f (X) = Y. Realistically, we are looking for f such that Y= 
f (X) + € with ¢ is an error which is random with a mean zero. We want € to 


be as small as possible. 


Two main tasks can be performed with supervised learning which are 
classification and regression. Classification is mainly done when the output 
data are a discrete variable. A discrete variable can be a quality or category 
data for example men or women, smoker or non-smoker. In contrast, 
regression is mainly used when the output variable is a continuous variable 


(i.e. quantity variable) like price of a house, number clients, age. 


A linear regression model is the simplest approach to model a regression 
problem. Regarding classification, a number of different approaches can be 
considered like logistic regression, decision tree, random forest or 
multilayer perceptron. 


The later is a type of artificial neural networks that will be covered in next 
chapter of this book. In this section we will cover the basic algorithms that 
are used for classification and regression which are logistic and linear 


regression. 


Logistic regression 


Logistic regression has the goal of classifying discrete variables or binary 
variables (i.e. 0 or 1 /yes or no). Given a set of input data X, we are looking 
for a function f that estimates the target value Y such that Y = f (X). Ina 
logistic problem f is the logistic function or what is called the sigmoid 
function. The logistic function is given by the following equation: F(X) = L 
/ (1 + exp (-X)). This function is based on probabilities. The sigmoid 
function is a logistic function with L equal to 1. This function has values 
between 0 and 1. The sigmoid or logistic function transforms any value into 
ranges between 0 and 1. This is a desired characteristic in machine learning 


to estimates probabilities of an input to belong to a specific class. 


The logistic regression requires a transformation of inputs X in order to get 


values between 0 and 1. The transformation is as follows: Z = B,+ B,* X. 


The regression is then performed using the new variable Z. To develop a 
logistic regression model, we need an objective function or what is called 


loss or cost function. 


The loss or cost function provides a metric to evaluate the performance and 
accuracy of the model. In a logistic regression problem this function is 
given as follows according to the target value: 
Loss function = - log {1/ [1 + exp (-Z)]} if the target value y is equal 
to 1 
Loss function = - log {1- 1/ [1 + exp (-Z)]} if the target value y is 
equal to 0 


These two equations can be merged into one single equation as given 
below: 

Loss function = - (1 /n) * © [y * log (h(Z)) + (1 - y) * log 1 —h 
(Z))] 
with h is the sigmoid function given by the equation: h(X) = 1/ (1 + exp (- 
X). 


Linear regression 


Linear regression aims at providing a linear relationship between inputs and 
quantity outputs. The mapping function of a linear regression is as linear 
function. Given a set of input data X, we are looking for a function f that 
estimates the target value Y: Y = f (X) = W * X + B, with W and B are the 
parameters to the model. The parameter W is called the weight and B is 
called intercept or bias. Training a model is the process of estimating the 
parameters W and B. As for the logistic regression, we need a lost or cost 
function that provides a metric of the model accuracy. In linear regression 
model, the cost function a function that quantifies the distance between the 
actual value of the output and the estimated value by the model. 


Mathematically, the cost function is: 


Loss function = (1/ m) a > Visi Vowels with Y predicted and Y target are 


the predicted and true output value resp. In this case, the loss function is 
nothing else than the Root Mean Squared Error (RMSE) between the mode 


output and the actual value. 


The parameters W and B are estimates such that the loss function is small 
as possible. In order to do so, we use an optimization algorithm. There are 
different optimization algorithms available. We will cover the optimization 


algorithms in Chapter 4. 


2.2 Unsupervised learning 


Unsupervised learning is mainly used when the data is unlabeled. In other 
words, the input data are available but the outcome of this data is 
completely unknown. Unlike the supervised learning, there is no correct or 
training dataset to learn from. In this case, the machine learning algorithm 
or specifically the unsupervised learning algorithm is designed to learn 
from the data itself without additional information or guidance to constrain 


the output. 


The unsupervised learning algorithms are based on similarities, differences 
and patterns to learn the hidden structure in the data. The unsupervised 


learning can be used for clustering, association and anomaly detection. 


Clustering aims at identifying the inherent groupings in data. Association 
has the goal to detect and describe patterns that happen together in a 
dataset. Finally, anomaly detection aims at detecting unusual data points 


within a dataset. 


Clustering 


Clustering is mainly applied for grouping data into groups or what’s 
commonly called clusters. A cluster is a set of data sharing similar 
characteristics. Clustering is important to determine the inherent grouping 
in unlabeled datasets. There is no guidance on how to cluster data and 


depends on the needs of the user. 


The data can be classified as to define homogenous groups in data or typical 
data. In contract, the data can be classified unusual a-typical data or 
outliers. Finding the homogenous and typical data can be useful for space 
reduction and describing patterns for dominant data or clusters. In the other 
hand, finding the outliers or atypical data can be very useful to understand 


the unusual patterns. 


Clustering is very handy in different domains such as marketing or targeted 
marketing to groups clients for customized marketing, biology to classify 
different species in order to better study them, planning to group houses and 
analyze their risk of inundation based on their location, insurance and 
banking to classify customers as well as their policies, transactions and 


detect fraudulent transactions. 


Usually, the clustering algorithms use a similarity measure to identify 
clusters in the data. The similarity measure can be defined as Euclidean 
distance or a probabilistic distance. Several algorithms can be used for 
clustering. These algorithms can be classified as density, hierarchical, 
partitioning or grid-based methods. The density-based methods consist of 


classifying data based on their density in the data space. Every dense region 


is considered as a cluster with similarities and different from the other lower 


dense region. 


Hierarchical clustering forms clusters based on pre-defined clusters. The 
algorithm starts with all data in the same clusters and divide it into several 
clusters or starts with each data point as single cluster and merge them into 
clusters iteratively. Partitioning based methods on the other hand, split data 
into k clusters such that it optimizes a criterion reflecting the similarity 
between groups within each cluster. Finally, the grid-based methods 
discretize the data space into a grid structure and the clustering is performed 
on the grid. In this book we will cover the widely used algorithms of each 


category. We start first the most popular algorithm the k-means algorithm 


The k-means algorithm belongs the partitioning category of clustering 
algorithms. It is an algorithm the is based on centroids meaning that the 
data points are classified according to their distance into the centroid of a 
cluster. Centroid is the center of a cluster. The algorithm starts with a 
random k clusters formed randomly from the data. Every point of the data is 
assigned to a cluster such that the distance between the point and the cluster 


centroid is small. 


Then, for each cluster the centroids are recalculated. The algorithm repeats 
this procedure iteratively until it reaches a convergence criterion. This 
criterion can be a non-significant change in data forming clusters meaning 
the centroid is similar or a maximum of number of iterations is attained or a 
non-significant change in the re-assignment of data to clusters. Note that the 
number of clusters k is a user-defined parameter that should be fixed 


beforehand. This algorithm is widely used because it is very straightforward 


to implement it, simple to understand and computationally efficient. 


However, this algorithm suffers from some drawbacks. 


This algorithm is only applicable when data are continuous and the mean is 
defined. Indeed, the cluster centroid is defined based on the mean of data. If 
the mean is not defined for the data other functions should be used to define 
the centroid. For instance, if the data are categorial or binary, the centroid 

can be defined based on the k-mode. The k-mode provides the frequency of 
values. This algorithm is also sensitive to outliers in data because it is based 


on the mean which is not a robust statistic. 


The mean is highly sensitive and can easily change significantly when an 
outlier exists in data. Another downside of this algorithm is the fact that is a 
deterministic algorithm in the sense that each point is assigned to one 
cluster and a cluster only. However, in reality data points might fit in 
different clusters and clusters may be overlapping. The k-means algorithm, 


the way it defines clusters, cannot define overlapping clusters. 


Gaussian mixture models are a distribution base method that help 
overcoming some issues mentioned for k-means algorithm. The gaussian 
mixture models uses mean, covariance and the size or weight or define each 
cluster. As mentioned, this type of algorithms uses the distribution of data 


to define the clusters within a dataset. 


Self-organizing maps is a grid-based method that is fully defined as an 
artificial neural network. This category of artificial neural network is 
different than artificial neural network typically used and covered in the 


next chapter of this book. The self-organizing maps has the goal of 


discretizing the data space. It uses a competitive leaning and a 


neighborhood function to preserve the data shape that is fed to the model. 


Association 


Association has the of describing patterns and data features that happens 
frequently together and are correlated. For instance, patient that have 
diabetes might have high blood pressure. In contrast with clustering, 
association help identify the data features that happen together and maps the 
relationship between the features. Clustering help identify data points that 


are related and have similar features. 


For instance, if we represent the data as a table where each row is a data 
point with features given in each column, clustering classifies the rows 
together (i.e. data points) while association find hidden relationship 
between the columns (i.e. features). Association is very handy in different 
domains. In marketing, association will provide information about the 
products that are usually both together. Given this information, the products 
will be put in the same aisle in the store. Association has different 
applications in health care. For example, analyzing eating habits in patients 
with a certain disease which might help understand whether certain foods 
might cause it, or used to understand the diseases that have high probability 
of occurring together. The association cane be sued in genetics science to 
have a better understanding of which genes are trigged together. It can be 
applied in city planning to detect busy traffic intersections. So now how do 


the association rule work? 


Association identify the co-occurrence of data features. It helps answer the 
question if A happens what is the other feature B that is highly probable to 


occur. In short, association rule detects the hidden if-then associations in 


large datasets. These if-then associations are what is commonly named as 
association rules in machine learning. The two parts of the association rule 
are called the antecedent for the if part (i.e. A) and consequent for the then 
part (i.e. B). 


Association relies on three metrics to identify the associations rules namely 
support, confidence and lift. Support and confidence are metrics used to 
measure how powerful the relationship between two features or the rule (if 
A then B). Lift is a metric to measure the level of confidence to assign to 
the rule by comparing the confidence of the rule with an anticipated 


confidence. 


Support is defined as the number of times A and B occur together. In other 
words, support is the frequency of the A and B occurring together which 
can be estimated as the fraction of events of A and B occurring together. 
Support provides the marginal probability p (A). Confidence, in the other 


hand, measures how many times B occurred when A occurred. 


In other words, how many times B occur when A occur. Confidence 
provides then the conditional probability of B knowing that A occurred 
which is p (B/A). Overall, support answer the question how frequent is the 
rule if A then B. If it has high frequency then it is worth considering the 
rule. Confidence answer the question how many times the rule if A then B 
is true. The lift is the ratio of confidence and support. If the lift value is 
above 1 than A and B are positively correlated, if it is less than 1 then they 
are negatively correlated. If the lift is 1 than A and B are not correlated. In 


order to use an association rule learning algorithm a minimum threshold 


support value and a minimum threshold confidence value should be 


imposed to the algorithm. 


The algorithm then provides all association rules that have a support value 
equal or above the minimum threshold support value and confidence value 
equal or above the minimum threshold confidence value that were 


imposed. 


There are different algorithms to use for an association rule learning. We 
cite Apriori, SETM and AIS. In this book we will focus on the widely used 
algorithm the Apriori. This algorithm consists of finding all the rules that 
have a support greater or equal to a minimum support threshold. To 
illustrate the functioning of the Apriori, let’s denote S as the minimum 
support threshold, an association rule (if A then B) as R and the number of 


iterations as 1. 


The Apriori algorithm starts by generating all sets of rule R having a size 
equal to 1. The number of iterations is set to 1. The algorithm downsizes 
rules R with size i such that the support p (R) is less than S. Then it 
generates all sets of size i+ 1 such that these sets include the current sets of 
size i. The algorithm increments the number of iterations (i = i + 1) then 
downsize the rules and so on until it finds all rules with support that is at 
least the S value. Now you are probably thinking but how do algorithm 
generate sets of rules. 


Generation of rules relies on two steps. The first step consists of generating 
itemset. If we take for example, the association rules for patients with 


several diseases and we try to understand if a patient have a disease A what 


is the other disease B that this patient is high likely to get. An itemset in this 
case would be for instance {diabetes, hypertension, Alzheimer}. The 
second step is generating rule from each itemset {if diabetes then 
hypertension and Alzheimer}, {if diabetes and hypertension then 


Alzheimer} and so on. 


So now if we go back to the Apriori algorithm, we said it starts by 
generating all sets of rules having a size equal to 1. This means in our 
example it starts by generating itemset where each disease is an itemset 
such that: {diabetes}, {hypertension}, {Alzheimer}, {dementia}, ... Then 
from it only keeps the itemset that are frequent among all itemset. At this 
step only itemset that have a frequency at least the minimum support 
threshold fixed. In other words, the fraction of cases when this itemset 


occurred is greater than the minimum threshold. 


So, for instance dementia occurs only 5 times in a 1000 times cases which 
means the fraction is 5/1000 = 0.005. This fraction is very low, then it is not 
considered as a frequent itemset. The Apriori algorithm relies on the 
concept of anti-monotone characteristic of support. This concept stipulate 
that the frequency of subset of an itemset is always greater than the 
frequency of the itemset itself. To explain this concept let’s always 


consider the patients example. 


For instance, if we consider the itemset {diabetes, hypertension}, the 
frequency of this item set is greater or equal to the frequency of the itemset 
{diabetes, hypertension, Alzheimer}. Therefore, if an itemset have a 
support value greater the minimum threshold, all subset of this itemset have 


a support value greater than the minimum threshold. 


We can look at the anti-monotone concept from another angle that is we 
decrease the size of itemset, we conserve or increase the occurrence 
frequency of the new itemset. So, the Apriori algorithm after it generates all 
itemset of size 1 and drops the non-frequent ones, it generates all possible 
itemset with size equal to 2 formed by the itemset of size 1. Then it keeps 
only the frequent itemset with size 2 that have support value above the 
minimum threshold. Then it forms items of size 3 and so on by increasing 
the size of itemset by 1. After generating frequent itemset, the rules are 
identified. 


The identification of rules consists of generating all possible combination 
among the items in an itemset. For instance if {diabetes, hypertension, 
Alzheimer} is our itemset, then the possible rules are : {if diabetes then 
hypertension, Alzheimer}, {if diabetes and hypertension then Alzheimer}, 
{if hypertension then diabetes and Alzheimer}, {if hypertension and 
Alzheimer then diabetes}, {if Alzheimer, then diabetes and hypertension}, 
{if Alzheimer and diabetes, then hypertension}. Form this possible rules, 
rules with confidence level that is greater than the minimum confidence 
threshold are kept. 


Confidence has the same property as support which anti-monotone 
characteristic. As we said explained earlier in this section confidence of a 
Rule R {if A then B} is the conditional probability of B knowing A. 


This means that all subsequent rules generated from the different 
combination of itemset in Rule R have a confidence lower than the 


confidence of all items combined. 


Let’s take an example to better understand the concept. Let’s say we have 
four items A, B, C, D. The confidence of {if A then B, C, D} is lower than 
the confidence of {if A and B then C and D} which is also lower than {A 
and B and C then D}. In other words, the rule {A and B and C then D} have 


the greatest confidence. 


This comes from the fact that the support of rules from an itemset is the 
same. The only difference is towards which itemset the confidence of this 
rule is calculated. If the number of itemset in the denominator are large than 
the confidence decrease. Remember the confidence is support (Y) / support 
(X). If X is formed from few items then it supports is high. Therefore, the 
confidence of Y will decrease by decreasing items in X. So, the Apriori 
algorithm downsizes the number of rules following the same procedure as 


for the frequent elements. 


The number of association rules will always depend on the minimum 


thresholds fixed for the support value and for the confidence value. 


Anomaly detection 


Anomaly detection or outlier detection aims at identifying outliers in a 
dataset. Before diving into anomaly detection machine learning, let’s 
understand first what are outliers and why they might be important. Outliers 
are simply data points that are far away from the majority of data points in 
the space of dataset. Basically, the outliers can be either extreme high 
values or extreme low values. The outlier data point usually show a 
different pattern or behavior in the dataset than the rest of the data points. 


They can be detected visually. 


For example, a recorded temperature of 20°C during winter in a Nordic 
country or a peak flow occurring during winter. In Nordic countries, the 
peak flow is likely to happen during spring. Depending on the context, 
outliers can be simply an error in the data records or an error generated 
from the processing of raw data. In other context which we will discuss in 
this book, is when the outliers hold valuable information that helps 


understand certain patterns. 


Let’s take again the example of peak flows. If we detect an abnormal high 
spring peak flow during a specific a year in a long record of flow timeseries 
covering several years, we might suspect that this peak flow is an outlier. 
However, before jumping into the conclusion that this flow is an error in the 
flow records, we might want to check if there are certain abnormal 
environmental process, such as high precipitation combined with a 
snowmelt caused by high temperatures, that has actually caused this high 


peak. In this case the detection of outliers may be handy to help understand 


the rare extreme weather events that might cause inundations. Outliers are 


detected based on patterns. 


For instance, a fraudulent transaction can be detected based on the amount 
of the transaction and the place of transaction. These patterns are compared 
to the recorded previous patterns. Let’s say for example a credit card or a 
bank account is stolen. The credit card company or the bank can detect the 
outliers or anomalies if an unusual transaction has occurred from the normal 
usage of the credit card or the bank account holder. The outlier detection is 
somewhat similar to clustering but it has a different goal. Instead of finding 
similar data points which is the goal of clustering, anomaly detection aims 


for detecting the unusual data point among a dataset. 


The anomaly detection is very handy in different domains. For example, 
anomaly detection can be used to detect intrusion into a system. The way of 
connecting to a system, the network traffic can be considered as patterns to 
trigger a confidential level of a system and be assigned as an intrusion 
which a sort of anomaly detection. The fraudulent and suspicious 
transactions in banking and credit card systems are a direct application of 
anomaly detection. 


In machine learning the anomaly detection can be classified into three 
categories namely global or point, contextual and collective. The global 
anomaly or point anomaly is a common anomaly detection. This type of 
anomaly detection aims at identifying data points that are far from the rest 
of the other data points. However, with this type of anomaly detection, it is 
challenging to determine at which distance from the rest of the data a point 


should be considered as an outlier. There is a lot of research on this subject. 


One way to do so, is consider a distance from the normal or average of the 
data points. However, this assumption may not be the best approach if we 
consider several aspects or the context when this point occurred. Therefore, 
contextual anomaly detection can be useful. Indeed, contextual anomaly 
detection aims at detecting anomalies that occurs in a certain conditions or 
context. Let’s consider again our example of temperature in Nordic 


countries. 


Temperature above or around 20°C during winter in a Nordic country is an 
obvious temperature anomaly in this case. However, the same temperature 
in a southern country is normal. In this example, the location where the 
temperature is observed provide the context to consider a temperature 
record an anomaly or not. The collective anomaly detection is different than 


the other the global and contextual detections. 


The collective anomaly detection aims at detecting the data points that 
occur together which from an anomaly. This data points sets are not 
necessarily anomalies if they are considered individually but the fact that 
they happened all together at once is the anomaly. For instance, in stock 
market, a price of a stock remains the same for a long period of time. It is 
usual that a price of a stock remains stable but it is expected to have some 
fluctuation over time. The fact that the price does not change over time is 
the anomaly. Now, that we understand the anomaly detection, let’s see what 


algorithms can be used. 


The are numerous methods to apply an anomaly detection. The simplest and 


direct method is to use a statistical based approach that consist of detecting 


points that largely diverge from the rest of the data points. The K-nearest 
neighborhood algorithm is another method that can be used. The nearest 
points are identified according to the Euclidean distance or the Mahalanobis 
distance. The k-means algorithm that we explored for clustering can also be 
used. For anomaly detection, data points that are not assigned to clusters 


formed by similar data points are considered anomalies. 


Anomaly detection as we have explained in this section is very important in 
large different domains. The anomaly detection can be used for the 
verification of the performance of applications. The performance of an 
application may have a great impact on the productivity and the income 
generated by the application. It is very important to detect if their some 
inefficiencies and problems in the application as soon as they happen to 
react accordingly and fix the problem before it impacts the productivity and 


the income. 


That is why the anomaly detection is important is this case to help detect 
anomalies within the application. The application can include the banking 
or trading platforms for example. The verification of the quality of products 
can also benefit from anomaly detection to help detect anomalies in a 


production system to prevent revenue losses. 


The anomaly detection can be also used to improve user experience or 
detect anomalies in systems that provide user services such as the online 
business. System should be updated and maintained for an optimal user 
experience. Anomaly detection technics helps identify roots of problems 


and therefore an optimal interaction time to analyze the problem and solve 


it. Overall, the anomaly detection provides an automatic tool to perform a 


real-time verification and detection of a system failures. 


This helps a fast and quick analysis of the root of problems which is 
beneficial to cut and prevent the losses as well as the damages that might be 
caused by these anomalies. The key elements to a successful anomaly 
detection are having large datasets to learn from as well as setting the 
optimal strategy to implement a supervised and unsupervised leaning to get 


the most information of the available data. 


2.3 Semi-supervised learning 


The semi-supervised learning is a hybrid approach that combines both 
supervised and unsupervised machine learning technics. It uses both labeled 
and unlabeled data. The semi-supervised learning is very hand when a mix 
of the labeled and unlabeled data are available. In real world application, 
the labeled data are not always available. So how does semi-supervised 


learning works? 


Semi-supervised learning starts first by training the model on the labeled 
data. Then, it applies the trained model on the unlabeled data to generate 
more labeled data from these unlabeled data. This allows to build other 


model from the generated labeled datasets. 


The pseudo learning a simplest method to do a semi-supervised learning. 
This process relies on the concept described above. It starts by training the 
model on the training dataset. Then it applies the model to predict the 
output of the unlabeled data. These outputs are merged with labels of the 
training dataset. The model is then fitted on the new formed data to enhance 
the model. 


There are several semi-supervised learning algorithms. We cite the self- 
training algorithm, the multi-view algorithms, the graph-based algorithms 


and the generative models. 


The self-trained algorithm consists of the same principle we presented 


before, where the model is trained on the labeled data, applied to predict the 


unlabeled data. Then the algorithm adds the predicted unlabeled data to the 
labeled data and repeats the same process. The self-training algorithm is the 


simplest and the most forward algorithm for the semi-supervised learning. 


2.4 Reinforcement learning 


Reinforcement learning is a different type of machine learning. This type 
of machine learning has different goals than the supervised and 
unsupervised learning. Reinforcement learning has the goal of optimizing a 
system in a certain context. It aims at finding the optimal strategies and 


actions to take in a contextual situation based on the feedback or a reward. 


Reinforcement is applicable only for certain problems. Games are an 
example of these problems. Reinforcement learning can be used to build 
wining strategies. The reinforcement algorithm tries to identify the best 
actions to take depending on several criteria in order to win the game. We 
will discuss these criteria later in this section. Now why reinforcement 


learning is important in some cases? 


Reinforcement learning comes handy in order to automate the learning 
process and identify the best strategies to take. It helps learning and 
adapting its own behavior according to the feedback it gets from taking a 
specific action or decision. This learning allows building systems with little 
human expertise in the domain of the application. It also computationally 
efficient than building a model with series of rules to provide an optimal 
solution to the problem. Reinforcement learning can be applied in different 
domains. Like we stated before, it can be used to develop a game or more 
precisely logic games like chess or poker. It also handy in robotics 


engineering. 


Reinforcement learning can be applied to develop a set of optimal motors to 
control a robot or robot navigation to learn behaviors of collision in order to 
prevent them. Now how do reinforcement learning algorithms work? 
Before jumping into that let’s define the major components of a 


reinforcement learning components. 


Reinforcement learning paradigm relies on seven concepts or components 
namely environment, agent, action, state, reward, policy and finally value. 
The agent is the core component of the reinforcement learning. An agent is 
the part of the learning paradigm that take actions. It can be an algorithm 
that models a concept or an algorithm that simulates a logic game for 
example. Action is the combination of all possible moves or decisions that 


an algorithm can take. 


These actions are usually pre-defined and the reinforcement algorithm 
choses among all these possible actions. In we take again our example of 
games, actions can be the directions that a player can take like moving 
right, left, up, down. These actions can also include the pace and speed of 
moving like jump, fast or slow or more the speed range values. State is the 
current condition where the agent is in which can be a specific location or 
moment that locate the agent in relation with other elements within its 


environment. 


Reward is the feedback that agent receives from taking an action. The 
reward tells whether the action taken by the agent is successful or failure. 
Basically, the reward is a measure of how successful the action performed 
by the agent is. The reward is the objective function that the reinforcement 


learning algorithm is to optimize. Environment is the work domain where 


the agent operates. Generally, the environment is a procedure that process 
the action and the state of the agent. The environment evaluates the reward 
according to the state of the agent and the action made by the agent. 
According on the evaluated reward, the environment provides the next state 
of the agent. 


Policy is a plan or a scheme that agent follows to determine the next action 
to take considering its current state. Policies assign actions to states of the 
agent in order to map the series of wining or rewarding actions. Finally, the 
value is a possible reward that the agent would have obtained it if have 
taken an action in a specific state. So how reinforcement algorithms 


optimize a system relying on these components? 


The agent starts by making an action in a specific state. The environment 
evaluates this action and returns a reward and a state of the agent. Based on 


this state and reward the agent will take another action. 


This process is repeated iteratively to allow to the agent to take a series of 
actions in different states. An optimal series of actions assigned to all state 
which forms a policy is defined once all the states and actions are 
performed numerous times. Basically, the algorithm tries to identify the best 
sequence of actions according to the state of the agent that should be taken 


in order to maximize the reward. 


There are two schemes of reinforcement learning. The reinforcement 
learning can be positive or negative. The positive reinforcement aims at 
increasing the frequency of an action if a state or a reward occurs every 


time this action is taken. The negative reinforcement learning in the other 


hand, aims at decreasing the frequency of an action if a reward is not 


received or avoided. 


Tow algorithms are widely used for reinforcement learning namely State- 
Action- Reward- Action and Q-learning algorithm. These algorithms 
usually use a neural network as the agent to improve their performance. 
Indeed, neural network are used to solve lot of machine learning algorithms. 
In the next chapter we will learn the basics and fundamentals of artificial 


neural networks. 


Chapter 3: Artificial Neural Networks 


Artificial neural networks are the craze of today’s era. They are widely used 
in machine learning and Artificial intelligence. Moreover, they become an 
active research domain to improve their performance and be applied for 
large applications. So now you are wondering what are artificial neural 


networks? 


How they work? How can you build an artificial neural network to solve a 
complex problem? We will answer all these question in this chapter of the 
book. Now let’s first understand what is an artificial neural network and 


what is the main idea behind using this concept in machine learning. 


3.1 Fundamentals of artificial neural network 


Artificial neural networks are inspired from the human brain functioning. 
They emulate how the human learns. The artificial neural network is 
composed of several neurons that are connected to ach other. Each neuron is 
responsible of learning a part of the problem and afterwards they 
communicate that information to produce a final output. The artificial 
neural network works as black box, meaning that how they produce and 
they learn are not fully understood. 


Typically, an artificial neural network is composed of an input layer and an 
output layer and one or more hidden layers. The input layer constitutes the 
input data that are fed to the neural network. The output layer is the target 
or the desired output. The hidden layer(s) are where the input data 


processed. These layers or layer is composed of several neurons. 


So, what is a neuron in artificial neural network and how do we build an 
artificial neural network from several neurons? How do we connect all 
these neurons in layers? In this section we will answer these questions and 
we will explain in details the terminology used in artificial neural networks 
and their architecture. 


Let’s start first by the definition of neuron the core element in the network. 
A neuron a is simple mathematical function that sums the weighted inputs. 
Let’s say for example we have M inputs of data that forms the vector X = 


{X,, X>,..., X,,$ Where x, is a point element of the data. A neuron is the linear 


combination of all inputs given by the equation below: 


F(X) = {X,, Xo. +++) Xqgh = Wy «X, + Wy4X,+.... + Wy, +X, Where w, is the 


weight attributed to each input. 


Now the input data X can be in the form of a matrix where each x, is a 


vector of data features. In other words, each observation of the input data is 
a vector of features. The neuron can be expressed in matrix form such that: 
F (X) = W * X; with W is a weight matrix. 


So, basically a neuron is a linear function that is fed by input data to this 
neuron and returns a single output value. The weights are the parameters of 
the neuron. When we talk about training an artificial neural network, we are 
talking about estimating the matrix weight matrix W that provides a very 
high accurate output that the most similar to the desired or expected output. 
So, the weights are what differentiate each neuron from the other. We will 


see later in this chapter how to define the weights for each neuron. 


Now, the neuron is a liner function of inputs to this neuron. But not only 
that, a neuron applies a non-linear function on top of the value returned by 
linear function. This non-linear function is called an activation function. So, 
to sum up a neuron is a weighted sum of the inputs with an applied 
activation function. Let’s denote an activation function G. So, our formula 
for a neuron becomes: 

G [F (X)] = G[W * X]; with W is a weight matrix and X the inputs. 


The activation function determines if that neuron should be fired (i.e. 


activated) or not based on the model predictions. The activation function is 


crucial element of artificial neural network and should be carefully chosen. 


It has a great impact on the artificial neural network output. 


We will go in more details in the next section about the activation functions 
that can be used within artificial neural networks and how each one works. 
So now the question is how do we build a network from these non-linear 


functions which represents several the neurons? 


An artificial neural network is simply a mathematical function that connects 
all neurons together. So, what do we mean by connecting neurons? Let’s 
look at the example of artificial neural network presented in the figure 


below: 
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Example of artificial neural network 


In the figure, we see that we have 4 layers. The first and the last are resp. 
the input and output layer. Neurons in the input layers are not mathematical 
functions like the one we explained before. These neurons are simply the 


input data we feed the artificial neural network. 


The computation starts from the hidden layer 1. Looking at the two in- 
layers or hidden layers 1 and 2, we can see that the input from each neuron 
is fed to all other neurons in the next layers. This is what it means 
connecting the neurons. The final layer, the output layer takes as input 


values produced by the previous layer and produce a final output value. 


This neuron is computed using similar equation we described before. In the 
final output layer, we are typically interested in a probability value between 
0 and 1. For example in classification problem, we want a probability of 


input to belong to a specific class. 


Hence, we use a non-linear function that is bounded by 0 and 1. In the next 
section we will see what activation functions satisfy this condition. In short, 
an artificial neural network is a non-linear function that is applied to a linear 
function applied to a non-linear function. This might seem complicated 
now. To better explain it and make it clear let’s take the example of the 
artificial neural network in the figure above and explicitly understand the 


output from each layer. 


We said the input layer does not contain any neuron in the sense of the 
mathematical explanation we provided for neurons. These are just data fed 
to the artificial neural network. So, the second layer computes the linear of 
the inputs. That is, if we have for example 5 neurons in the hidden layer 1. 
We have for each neuron G [F (X)] = X * W. So, in the second layer, each 
neuron is fed the output value of all neurons of hidden layer 1. That is for 
each neuron we have: G [F (X,)] = G (x11 * wll, x12 * wl2, x13 * w13, 


x14 * w14, x15 * w15) where x,, is the output from the neuron i in hidden 


layer 1 and the w,, is the weight associated to it. 


The output layer does the same calculation for the second hidden layer. If 
we have for example 5 neurons in the hidden layer 2, the output is: G2 [F 
(X,)] = G2 (x21 * w2l, x22 * w22, x23 * w23, x24 * w24, x25 * w25) 


where x,, is the output from the neuron i in hidden layer 2 and the w,, is the 


weight associated to it and G2 is the non-linear activation function. 


So, if we write the formula compacted for all layers, we have the final 
output as: 

Y =G2 [F2 [G1 (F1 (X), ..., F5 (X)), ..., G5 (F1 (X), ..., F5 (X)]] where 
the G2 and F2 the non-linear function and linear function applied in the 
hidden layer 2 respectively. The G1 and F1 are the non-linear and linear 
functions applied in the hidden layer 1. F1 and F2 are different because the 
weights W is different. The formula given above is for an example of layers 


composed of 5 neurons. 


To summarize, artificial neural network computes the output using an 
activation function that is applied to each output layer which is the 
weighted sum of the inputs to that layer. In order to define an artificial 
neural network, we need to define the number of hidden layers, the number 


of neurons per hidden layer. 


That is what defines the structure or architecture of the artificial neural 
network. We also need to define the activation function and the weights 


associated to each neuron for each layer. These weights are defined in the 


training process and fixed when the artificial neural is tested. Other 
parameters should be defined in the process of training the artificial neural 
network. We will discuss those parameters later in this chapter when 
explaining how to train an artificial neural network. Now let’s discover how 
do we choose and define the activation function one of the most important 


elements in an artificial neural network? 


3.2 Activation functions in artificial neural network 


Activation functions are the crucial component of an artificial neural 
network that make the decisions of activating or not a neuron. The 
activation function has the role of evaluating the output of each neuron and 
deciding if that output value should be considered or connected to the other 
neurons. If take back the definition of a neuron. A neuron is linear 


combination or the sum of the weighted inputs to that specific neuron. 


The neuron does not have any information on how reasonable that values is 
or what are the acceptable ranges for that value. So, the activation function 
is the function that provide that information within an artificial neural 
network. It checks the output of each neuron of the network. There are a 


wide variety of activation function that can be used. 


We cite the step function, linear activation function, the logistic or the 
sigmoid function, the tanh function, the Rectified Linear Unit function and 
the softmax function. These are the most commonly activation functions 
used. Indeed, lot of research applications are interested in defining optimal 


activation function to implement in artificial neural networks. 


The step function is the simplest and most intuitive activation function. As 
we explained before activation function decides whether to activate or not a 
neuron within an artificial neural network. The step function is based on the 
simple idea of fixing a threshold. Based on this threshold, the neuron is 
active or not. For instance, if the output value of the neuron is above that 


fixed threshold than the neuron is activated and it is less than that threshold 


it is not activated. Mathematically if consider Y is the output of a neuron, 
the step function is defined as follows: 
F = ’activate neuron’ or F = 1, if Y is above the threshold. 


F = ‘do not activate neuron’, or F = 0, if Y is less than threshold. 


The step function is very handy for classification problems that expect a 
binary output as Yes or No (i.e. 0 or 1). Although the step function is 
straightforward and easy to understand it has a downside. The main 
downside of this function is its inability to distinguish between different 


classes for a classification problem. 


For example, we are faced with a classification problem where we have 
data that might belong to more than two classes. Because the step function 
only takes values 0 or 1, if more than one neuron is activated in the network 
the final output will be 1 for all output meaning that all inputs will belong to 


a single class. 


Or simply it will be hard to tell to which class each input belongs to because 
1 means that they belong to a certain class. In short, the step function 
cannot handle multiple output values or classification into more than one or 


two classes. 


The linear function is an alternative function that provides a wide range of 
values unlike the step function. The linear function estimates an output that 
is proportional to the input to the layer. The fact that this function is linear 


makes it not very useful in lot of applications. 


First of all, the linear function does not provide any information on which 
input has more weight. It does not allow backpropagation for the artificial 
neural network. In other words, the derivative or gradient of the linear 
function is always a constant which is the weight matrix W. Therefore, 
when training the model, we are not able identify the what changes on the 
input to the neuron improve the model performance because the gradient of 
the linear function is completely independent from changes into the input X 


of the neuron. 


Another downside of the linear function is that not matter how many layers 
we use in an artificial neural network they all can be reduced to a single 
layer, because a combination of linear functions is a linear function. So the 
final output is the same is as the linear combination of the first layer and the 
in between layers do not have any utility, which contradict the core idea 
behind artificial neural network that relies on the fact the more hidden 
layers you might be able to improve the performance of the artificial neural 


network. 


Overall, an artificial neural network connected by linear activation function 
is nothing more than a linear regression unable to handle the complex 
structure in the data. In fact, numerous or all almost all real applications or 
problem are non linear problems. The relationship that relates the inputs to 


the outputs is non linear and for that we need non-linear function. 


The non-linear function is able to map the hidden complex structures in the 
data. Hence, the non-linear activation functions are more useful within an 


artificial neural network and allow the network to learn the complex 


patterns in the data and in particular in high dimensional data like images or 


audios. 


The derivative or gradient of non-linear function always depends on the 
input X therefore it relies on the changes in X. Therefore, they allow for 
backpropagation in artificial neural network. They also support several 


hidden layers in an artificial neural network. 


Several non-linear activation functions can be implemented within an 
artificial neural network. In this book we will cover the commonly used 
ones namely the sigmoid function, the tanh function, the softmax function 


and Rectified Linear Unit function. 


The sigmoid function is a commonly used as activation function in artificial 
neural networks. This function takes values between 0 and 1. The 
mathematical formulations of this function is as given below: 

F(X) = 1/ (1 + exp (-X)) 


The sigmoid function is the inverse of the exponential of X plus 1. When 
values of X are above 2 or under -2, the values of the output of this function 
are close to extreme values 0 and 1. The figure below presents the curve of 
the sigmoid function. As we can see from the figure that for all values 
above 4 and under -4, the output Y or the value of the sigmoid function is 


around 0 and 1. 


There is a small change to almost a non-significant change of the sigmoid 


function evaluated for values outside the range of -4 and 4. This issue is 


what is called the ‘vanishing gradient’. The vanishing gradient means that 


the gradient is very small on the extreme part of the sigmoid function. 


The vanishing gradient problem is major downside of the sigmoid function 
which the learning process of artificial neural network very slow when they 
get close to the edges of the function. This has an impact on the 


computational efficiency making it very expensive. 


F(X) 











Sigmoid activation function. 


The tanh function has some similarities with the sigmoid function and also 
suffers from the same issue of vanishing gradient. However, the tanh 
function takes values between -1 and 1 with a more pronounced gradient 
than the sigmoid function. This function is zero centered which makes is 
very handy when inputs can be negative, positive or neutral. The tanh 
function is expressed mathematically as follows: 

F(X) = tanh (X) = [2/(1 + exp (-2 * X)] - 1. 


The Softmax function is a different activation function that comes handy 
when we face a classification problem into several classes. This function 
computes the probability of an input to belong into a certain class. It 
normalizes outputs of each category to have values between 0 and 1. It then 
divides the value for each output by the sum of all values to provide the 


probability for each input to belong to different categories. 


The Rectified Linear Unit function is conventionally called the ReLU 
function. This function is the most widely used function among the 
activation functions cited in this section. This function is computationally 
efficient and allows overcoming the problems faced when using the sigmoid 
function or the tanh function described before in this section. The ReLU 
function has a very simple mathematical formulation: 

F(X) = max (0, X). 


The ReLU function is basically the identity function when X is positive and 
is 0 when the input X is negative. So, the ReLU function takes values from 
0 to +inf. The ReLU function does not have an upper bound. The shape of 
the ReLu function might seem similar to a shape of a linear function, but 


the ReLU function supports a gradient. 


The only the drawback of the ReLU function is that gradient is null for all 
input values that are negative. Like for the linear function, the 
backpropagation cannot be performed when the input values are negative. 


The neural network is only able to learn on the condition that the input 


values are positive. The fact that of the gradient is null when inputs are 
negative is called dying ReLU. 


Other variations of the ReLu have been proposed to prevent the dying 
ReLU problem. The Leaky ReLU function is one variation we cite in this 
book. The leaky ReLU is defined as the identity function if the input values 
of X are positive and as 0.1*X if the values of X are negative. 
Mathematically this function is given by the equation below: 

F(X) = max (0.1 * X, X) 


The gradient of this function when the input values are negative is 0.1. That 
allows for backpropagation in the training process of the artificial neural 
network. But the predictions of negative values may not be very consistent. 
Another variation of the ReLU is the parametric ReLU. This function relies 
on the same concept of the leaky ReLU. It adds a pre-defined parameter as 
a gradient when the input values are negative. The parametric function is 
the identity function when the inputs are positive. When the input values are 
negative the parametric ReLU is the pre-defined parameter by the value of 
the input. This function is as follows: 

F(X) = max (a * X, X) 


Other variations of ReLU exist like the exponential linear ReLU. This 
function has a log curve function when the input values are negative. This 
might be a positive side of the exponential ReLU compared to the other 
variations of the ReLU that we cited. The leaky and the parametric ReLU 


functions have a linear function when inputs are negative. 


However, the exponential ReLU might saturate when the negative values 
are very large which is a downside of this function. Other variations of the 
ReLU function exists. They all rely on the same concept that is defining a 


non-null gradient for the function when the inputs values are negative. 


Now you are probably wondering how to choose an activation function to 
use within an artificial neural network given that there is no perfect function 
and each one has its pros and cons. In short there is no guidance how to 
choose an activation function. It all depends on the problem you are trying 
to solve. Better understanding what to expect from the neural network and 
the problem that you have in hand will guide towards an activation function 


or other one. 


If the characteristics of the target function to estimate is somewhat know 
beforehand that should help choosing an activation. For example, the 
sigmoid function or the Softmax activation function are a good choice for 
classification problems. In general, it is recommended to start with the 


ReLU activation function. 


Then you can try other activation functions until you reach a satisfactory 
neural network. ReLU is applicable for a wide range of applications or 
better you can try defining your own activation function. The key is to go 
through a trial and error process in order to better build an efficient neural 
network. It is an ongoing active research area and still there is no clear 


guidance on the subject of choosing an activation function. 


However, there is an aspect that is very important to count for when 
choosing an activation function which is sparsity. Sparsity means that the 
activation function does not activate all neurons at once. This characteristic 
is very desired in an activation function implemented within an artificial 
neural network because it allows for neural network to learn faster. It also 
decreases the probability of a trained artificial neural network to be 


overfitted. 


Let’s say for example we dispose of an artificial neural network with 
numerous neurons which is generally the case. If all neurons are activated 
then they are all processed to the final output. In this case, the network 
becomes very dense and decreases the computational efficiency of the 
neural network. The ReLU function has the sparsity characteristic which 
makes very efficient and that is why it is recommended to start trying this 
function first then move to other activation function if the results are not 
satisfactory. 


The sigmoid and the tanh activation functions are not very inefficient 
because they tend to activate almost all neurons. Therefore, they are not 
sparse functions. Now you have learnt how an artificial neural network 
works and the principal components of an artificial neural network. So how 
can we build a neural network or more precisely how the layers are 


structured? 


To answer to these questions let’s explore in the next section the type of 


artificial neural networks that exist and when they are applicable. 


There are different types of artificial neural networks to choose from. Each 
type has specific properties and a certain level of complexity which makes 
it applicable for certain problematics. In this book we will cover the most 
common used ones which are the feedforward neural networks, recurrent 
neural networks, multi-layer neural networks, convolutional neural 


networks, and modular neural networks. 


Perceptron and feedforward neural networks 


First let’s start by the original and the simplest artificial neural network 
developed which is the perceptron. All types of artificial neural networks 
that exist and the ones we will describe in this book all rely on the concept 
of the perceptron. We can view them as several perceptron with different 


characteristics connected all together. 


The perceptron was first introduced by Frank Rosenblatt in the 50s. The 
perceptron is a single neuron that is fed by the input data and applies a 
sigmoid function. In a simplified way, the perceptron computes the 
weighted sum of the input data and returns a value of 0 or 1 depending on 


the value of the weighted sum of inputs. 


Conventionally, if the weighted sum is negative the perceptron returns 0 
and if the weighted sum if positive the perceptron returns 1. In general, the 
perceptron is only applicable if it used to classify linear binary problems 
because it uses a sigmoid function and returns only 1 or 0 also it is based on 


a single neuron. 


The feedforward neural network relies on the simple concept of the 
perceptron that has one layer. This type of neural networks is among the 
first neural networks developed and used. The feedforward neural network 
as its name indicates, propagates the information in a single direction which 


is from the input layer to the output layer. 


This procedure of passing the information from the input to the output 
through the hidden layers is called the front propagated wave that is based 
on the information provided by the activation function. The feedforward 
neural network is composed of a hidden layer of several neurons that are 


fully connected. 


The output layer of this type of neural network is the value of the activation 
function applied to the weighted sum of inputs to the hidden layer. 
Backpropagation method is usually used to train such type of neural 
network. The logistic or sigmoid function is an activation function that is 


typically used within a feedforward neural network. 


The feedforward neural network has been developed through the years and 
numerous derivations of this type of neural networks exist now. The radial 
basis function neural is one derivation of the feedforward neural network. 
This derivation uses the radial basis function as an activation function 


instead of the logistic function. 


The radial function is a measure of distance to the center. In other words, 
the radial function provides the distance of each point to the relative center 
of all data points. Unlike the logistic function that allows mapping arbitrary 
binary values, the radial function provides continuous values that allows 


measuring the distance from the desired value. 


Typically, the radial basis function neural network, has two layers. Neurons 
in the inner layer are connected by the radial basis function. Deep 
feedforward neural networks are another variation of feedforward neural 


network. This type of artificial neural network is composed of several 


hidden layers. A new learning has immerged from this type of neural 
network called deep learning. The deep learning became a whole separate 
branch and active research domain in artificial intelligence and machine 


learning. 


The deep feedforward neural networks yield to better results than a basic 
feedforward neural network. However, there are some challenges to 
implement a deep neural network regarding the optimal number of hidden 


layers to use as well as training the neural network. 


Indeed, having more hidden layers is advantageous to get better results but 
the question of how many layers are enough remains. The number of the 


hidden layers should also be optimized in this type of neural networks. 


The feedforward neural networks are able to learn from training datasets. 
They provide an output depending on the value of the input. In some 
systems, we are interested in the value of the output knowing the value of 
the input but also according to the value of the previous output or 


depending on the value of the previous input. 


For example, in timeseries, the output in a timestep i will depend not only 
on the input on timestep i but also on the output at time timestep i-1. Let’s 
take another example speech recognition. To interpret a sequence of words 
in a context, the word’s meaning or reference in a sequence will depend on 
the previous words in the sequence. This is normally how the human brain 
works. When you Start interpreting something your interpretation along 
your reading depends on how you interpret or understand the previous 


elements. 


You don’t start from nothing at each element. The feedforward neural 
networks are unable to emulate this process of learning, simply because the 
information is processed in a single direction which is from the input to the 
output. So, there is an obvious need of a structured neural network that 
processes the output too. Here comes the advantage of recurrent neural 


networks to address this issue. 


Recurrent neural networks 


Recurrent neural networks are a category of artificial neural networks that 
take as input timeseries or any series that are structured within time 
framework or a space framework. They process information like the 
feedforward neural network from the input layer to the output layer. They 
however process information from the output layer to the input layer. That 


is the recurrent neural networks process the information in two directions. 


They use a memory at each node to save the output that is processed back to 
the input layer. So, the recurrent networks output is always impacted by its 
output from the past. They not only learn from the input training set but also 
from the previous decisions they make in the past. The recurrent neural 
through the memory implemented in the network they have a state vector 


that provides a context to process the input. 


This state vector or context is updated according to the input. So similar 
inputs may yield to different outputs according to the state of previous input 
in the sequence forming the input layer. In short, the recurrent neural 
network maps the relationship between the output, the input as well as the 
relationship within the inputs. So, at each time step, the input to a hidden 
layer h is a function as follows h,= G (W * X,+ Uh,,), where X, is the input 


at time step t and W is the same weight used at the feedforward and as 


explained before and Uht-1 is the hidden state at the previous iteration. 


The G function is an activation function that can be either a sigmoid 


activation function or the tanh activation function. 


There are different categories of recurrent neural networks that relies on the 
same principle of using other information than the inputs to produce the 
output. For instance, the bidirectional recurrent neural networks. This 
category of recurrent neural network uses the future possible output to 


predict the present output. 


Let’s consider speech recognition, it might better to consider the whole 
sequence to provide an interpretation of a word in a sequence. Interpreting a 
word based on previous words might leave some ambiguity, but include the 
next words in the sequence might remove that ambiguity and better 
interpret the present word. Recursive neural networks are a broad form of 
recurrent networks. The shape of this category of neural network is more 


like a tree shape. 


The hierarchical tree is formed by the inputs where each node (parent) is 
connected to other child node which lead to other child nodes. The 
recursive neural networks are very complex and computationally 
exhaustive. Another variation of recurrent neural networks is the Sequence 
to Sequence recurrent neural network. This class of neural network 
typically uses two recurrent networks where one is used to save updates in 
the hidden state and provide the final state output. This first network is 


called the encoder. 


The second recurrent neural network process the information provided by 
the encoder in order to produce the final output. This neural network is 
called the decoder. In short, the encoder recurrent neural network encodes 


the context of the inputs and the decoder translate that information into 


formal output. Here, the size of the output sequence should be the same as 
the size of the input sequence, unlike the other recurrent neural network 
where there is no limitation of the input size. Long Short-Term Memory is 


another class of recurrent neural networks. 


The Long Short-Term Memory was introduced by the two researchers Sepp 
Hochreiter and Jiirgen Shmidhuber to solve the problem of the vanishing 
gradient issue in recurrent neural networks. Remember the vanishing 
gradient happens when the gradient is no longer calculated or becomes null 
when the iteration goes through which does not allow updating the 
parameters and hence does not allow for the model to learn. The Long 
Short-Term memory help save the error which can be processed by 


backpropagation in time as well through each layer of the neural network. 


The Short Long Term allows as stated by the authors to the neural network 
to learn after two many times steps and over 1000 iterations. This is very 
common in machine learning algorithms and artificial neural networks 
where learning requires lot of iterations to learn. Unlike recurrent neural 
networks, the Long Short-Term Memory support saving information for 
over long iteration (i.e. period of time). The Long Short-Term Memory can 
be applied for prediction of time series or classification according to time 


series datasets. 


The Long Short-Term Memory has a different structure than the other 
recurrent neural networks. It has a structure of a chain that has 4 neural 
networks with implemented cells that are called memory blocks. The Long 
Short-Term Memory uses also the concept of gates to manipulate the 


memory. The gates serve to control the information contained in the cell so 


we Call it a gated cell. The Long Short-Term Memory uses three types of 


gates that allows saving, writing or reading information from a cell. 


These gates act like neurons meaning they multiply inputs by weights and 
apply a sigmoid or tanh function to make decision about saving, removing 
or reading the information according to strength of the information. The 
weights like for neurons are adjusted during the learning process of the 
network. Therefore, cells also learn when to store in information, read out 
the information or delete the information during the training process of the 


neural network by back propagation. 


The three type of gates that are used to control the information the flows in 
and out a cell are the forget gate, the input gate and the output gate. The 
forget gate, as its name indicates, removes the information which is not 
needed or not useful any more in the cell. This gate takes two inputs which 


are X, the input at the timestep t and H,,, is the precedent output of the cell. 


These inputs are fed to the forget gate weighted by a parameter matrix. An 
activation function returning 0 or 1 like the sigmoid activation function is 
applied to the resultant value. If the value of the cell is 0 than the 
information is removed and if the value is 1 then the information is kept for 
the eventual use in the following iterations. In fact, the forget gate can be 


viewed as a linear identity function. 


The reason for that is when the gate is open, the information is processed 
forward for another iteration or time step by multiplying the current 
memory cell state by 1. The second type of gates used is the input gate. This 
gate adds helpful information to the cell state. Like the forget gate, the input 


X, at the current step and H,,, is the precedent output of the cell are fed to 


the input gate. 


Then the sigmoid activation function is applied to the value and the 
information to be saved is selected. Afterwards, the tanh function is applied 


to create all plausible values out of H,, and X,. These plausible values are 


stored in a vector as values from ranging from -1 to 1. 


Finally, these plausible values are multiplied by the regulated values in 
order to get that helpful information. The third type of gates used by the 
Long Short-Term Memory is the output gate. This gate is responsible of 
drawing helpful information that should be processed from the current state 
of the cell. Using the tanh function on the current state of the cell, this gate 


generates a vector. 


This vector is then filtered via the sigmoid function to decide which values 


should be retained taking as input the current state of the cell X, and the 
previous state of the cell H,,. Finally, the vector values and the output 


values from the sigmoid function are multiplied and processed as an output 
and fed to following cell. So how these gates communicate and work within 
a cell of the Long Short Term? The first thing the Long Short-Term 
Memory does in a cell at a time step is taking decisions about which 
information should be dumped from the cell state. So, it applies the forget 


gate process. 


Then, it takes decisions about which new information should be saved in 


the state of the cell. In this step the Long Short-Term Memory applies the 


two-step process of the input gate. Finally, it takes decisions about which 
information should be processed forward or what information should be 


processed to the output. 


At this state the Long Short-Term Memory applies the processes of the 
output gate. So, to summarize the gated cell takes as input the previous state 


of the cell H,, and the current input X,. These inputs are processed through 


the three gates that are applied in the following order forget gate, input gate 
and output gate. These gates work the same way as neurons and have their 


own weights that are adjusted during the learning process. 


There are several variants of the Long Short-Term Memory that were 
introduced in different research papers. We cite in this book the gated 
recurrent unit, Long Short-Term Memory with peephole connections, depth 
Gated Recurrent neural networks. The gated recurrent unit variant does not 


use an output gate. 


Hence, the information contained in the memory cell is processed to the full 
network without filter at every time step. The Long Short-Term Memory 
with peephole connections adds a peephole connection that allows the gates 


of the cell to consider the state of the cell. 


Other variants of the Long Short-Term Memory include coupling the forget 
gate with the input gate. So, the gated cells of this variant make decisions 
about what information should be dropped and what information should be 
added at the same time, instead of making these decisions in a separate 


manner. So, the cell adds input to the state of the cell only if new 


information is being forgotten. In the same way, the cell forgets an existing 


information only if a new information is going to be added at its spot. 


Convolutional neural networks 


Convolutional neural networks are a branch or artificial neural network that 
is based on the concept of convolution. Before we diving in this type of 
artificial neural network, let’s find out what is convolution first. 
Mathematically speaking convolution is a function that takes two functions 


or signals and produce a new signal. 


The convolution function is the multiplication of two functions that are fed 
to this convolution function. In neural networks convolution is simply an 
application of a filter to the input in order to activate this input. The 
convolution multiplies the inputs data with weights. Now you think this the 
same concept a regular neural network. Indeed, it is similar to the operation 
done in regular neural network. However, the convolution takes two- 


dimensional input and the weights here a two-dimensional array. 


This multiplication is called a filter. Applying the same filter several times 
produces a map of activations that is commonly nominated as a feature 
map. This feature map provides an indication of how powerful a feature is 
and its location within inputs. The convolution neural networks are very 


handy when it comes to processing images. 


Image is a raster or matrix. We can actually convert that matrix into a vector 
and feed it to regular feedforward neural network for classification. 
However, the feedforward neural networks are not able to map the 
dependencies of space and time in the image. Because the convolutional 


network uses the convolution concept and applies filters, they are powerful 


to map the dependencies in an image. The image is fed as a matrix to the 


convolutional neural network. 


Remember the convolutional neural network takes as input multi- 
dimensional input. The convolutional neural network reshapes the image or 
matrix into by decreasing its size into easier form by conserving the major 


features that have more weights and crucial to produce accurate estimations. 


The layer where the convolution is done is called the convolutional layer. 
The convolution has the goal of detecting the important features in an 
image like edges, colors. A convolutional neural network can contain 
several convolutional layers. Now how the convolutional neural networks 
work and how do they use the convolution operation? We will answer these 


questions in details in the rest of this section. 


Convolutional networks perform four major operations which the 
convolution we cited before and what makes them special, non-linearity as 
all neural network types, Sub sampling or what is called pooling in the 
terminology of convolutional neural networks, and finally classification. 
These tasks are the very basic elements in any convolutional network. So, 
let’s go through each task in details. To do so, we consider here an example 


of an image. 


Remember an image can be represent simply by a matrix of values where 
each value is a pixel value. Conventionally a channel is used to point 
toward a specific feature of an image. Images have 3 channels where each 


channel represents blue, green and red. These channels can be represented 


by 3 of 2-dimensional matrices of pixel values between 0 and 255 


assembled on each other. 


Typically, grayscale images are represented only by one channel because 
they don’t have any colors other than the black and white. A pixel value of 
0 represents the black and a pixel value of 255 represents white. The 
grayscale images can be represented by one single matrix of 2 dimensions. 
For simplicity, we will consider a grayscale image to explain the 
convolutional network functioning. Hence, we consider a matrix I of 6 by 6 
dimension where values are between 0 and 255. To apply the convolution 


operation, we consider a second 4 by 4 matrix F. 


This matrix is what is called the filter or kernel in the convolutional 
networks. In fact, the second matrix slides over the first matrix that 
represents our image and apply element multiplication of the two matrices. 
The resultant matrix is conventionally nominated as the Activation map or 


Convolved feature. 


For instance, we denote the Activation map matrix as A. The first element 
A (1, 1) of the Activation map A is computed as the multiplication of F and 
I (1:4, 1:4), where I (1:4, 1:4) is the sub matrix of I that contains the four 
first rows and the four first columns. Then, the second element A (1, 2) is 
computed as the multiplication of I (1:4, 2:5) and F. Basically, the matrix F 


is sliding by one pixel over the image. 


The number of pixels by which the matrix is sliding is called stride. So now 
you probably must be thinking that setting up different filter values will 


provide different Activation maps and probably you are wondering how we 


can fixe the values of the filter matrix? Yes, indeed different filter matrices 
applied to the same input image will provide different Activation maps. 
These filters are adjusted in the training process. In other words, the 


convolutional neural networks will learn the filter values via the training. 


The challenge in implementing a convolutional neural network include the 
number of filters to use and the dimensions of the filter matrix. Here we just 
presented a random filter example. But to apply a convolutional neural 
network the size of the filter should be wisely fixed. Regarding the number 
of filters, higher number of filters implemented means that more features 
are going to be pulled from the image and the performance of the to 


recognize images is increasing. 


The dimension of the Activation Map is dictated by three factors which 
should be fixed prior to performing the convolution. These factors are the 
stride, the depth, and the zero-padding. The stride as we explained before 
represents the number of pixels according to which the filter matrix is 
sliding. 


For instance, if the stride is 1 then the filter is sliding by only one pixel as 
the example presented above. If the stride is 3, then the filter will move by 3 
pixels each time it is sliding. As general rule the smaller stride is the more 
feature are mapped. The depth represents the number of filters that are 
being used to perform the convolution task. Each filter will produce a 


different Activation Map. 


These Activation Maps can be assembled on each other to provide a Nd 


matrix. The depth hence would be N. The zero-padding allows to apply a 


buffer around the edges in order to be able to perform convolution operation 
on the edges of the images. Once the convolution operation is performed, 
an activation function is applied. It can be an ReLu, the sigmoid or the tanh 
functions. 


However, it is recommended to use the ReLu because it provides the best 
results. The application of the activation function adds the non-linearity to 
the convolutional neural network. This step provides a rectified Activation 
Map. The third step is the sub sampling or the pooling operation. The sub 
sampling consists of performing a dimension reduction of the Activation 
Map. 


This operation is done by retaining only the most useful information. 
Different sub sampling can be used like the sum, average or max among of 
others. The sub-sampling is done according to pre-fixed spatial zone that is 
defined like a filter and a stride. Let’s say for example we do a sub- 


sampling using the average. 


We can define a spatial zone as 2 by 2 matrix. So, we go through each 2 by 
2 block of the Rectified Activation Map and take the average value of the 
block. For example, we have 6 by 6 Rectified Activation Map Denoted R. 
The sub sampling is performed by computing the average of the R (1:2,1:2), 
R(1:2, 3:4) and so on for each block of 2 by 2 of the Rectified Activation 
Map. This reduces the dimension of the map of features. The sub sampling 
is applied to all Activation Maps that were produced through the different 
filters applied to the input image. 


The pooling or sub sampling has several functions which are reducing the 
dimension of the features of the image in order to make it easy to manage, 
making the network stable and invariant to small changes in the input 
image, making the network less prone to overfitting by reducing the number 
of parameters, it conserve the representation of the image and makes is easy 


to extract objects wherever they are placed within the image. 


After the convolutional operation and the sub sampling, the results from the 
sub sapling are outputted and used for image classification. All the features 
resulted from the convolution and the sub sampling are connected as multi 
layer feedforward network. The SoftMax activation is applied to the 


resulted output and produce a vector of values where the sum is equal to 1. 


So, to summarize the convolution and the sub sampling helps extracting and 
identifying the major important features of an image and the last layer 
defined as a feedforward network acts as the classifier. The convolutional 
neural network is trained similarly as the feedforward neural networks 
using feedforward and back propagation. In the convolutional neural 


network, filters are adjusted in addition to the weights. 


Modular neural networks 


Modular neural networks combine several neural networks that are 
connected to each other. Each neural network is responsible of solving a 
specific problem that forms a complex problem. In other words, the major 
complex problem is subdivided into sub-problems to be handled by each 
neural network that form the modular neural network. Hence, the combine 
strength of a combination of neural networks to tackle a complex problem. 
Basically, the idea behind modular neural network is ‘divide and conquer’. 


When you divide a problem, it is easier to solve each sub-problem 
separately. Now you have learnt the different types of artificial neural 
networks. Let’s find out in the next section what are the advantages and 
disadvantages as well as the challenges of implementing an artificial neural 


network. 


3.4_Pros and cons of an artificial neural network 


The advantages of using artificial neural networks are the same as the 
advantages we listed for machine learning in the previous section. The are 
able to handle large datasets, do not require expertise and in-depth 
knowledge in domain they are applied to. They can easily map complex 
structures in datasets. In addition, artificial neural networks are very 
flexible and can be applied for all kind of machine learning supervised, 


unsupervised and reinforcement learning. 


As we learned from the types of artificial neural networks, the different 
architecture of artificial neural network allows them to be applied to predict 
the evolution of timeseries data using the recurrent neural network and 
Long Short-term memory, simple classification using the feedforward 
neural network, or image recognition using the convolutional neural 


networks. 


Now you might be wondering what is the advantage of using artificial 
neural networks over the methods we presented in the previous section such 
us logistic regression or linear regression. The main advantage of using 
artificial neural network over these methods logistic and linear regression 
methods is that artificial neural networks does not require any statistical 
adjustment. In other words, artificial neural networks are a non-parametric 
approach. In addition, artificial neural networks do not require any 
assumptions regarding the distribution or structure of the data. Actually, 
you can consider artificial neural network as a regression applied on a 


multi-dimensional space. 


If we consider back the example the regression example in previous section, 
we are trying to approximate a function f(X) = W * X + B that provides the 
output Y. Now the artificial neural network does the same thing. Each 
neuron is the same as the function f. Remember the output of neuron is the 
weighted sum of input which is exactly the same function as f with B as 


bias. 


Now you might think that artificial neural network is the same as linear 
regression. But don’t’ forget the artificial neural networks apply an 
activation function to this output. That is why artificial neural networks are 
able to solve non-linear problems. Moreover, artificial neural networks 


allow for parallel processing. 


The major disadvantage of artificial neural network is that they operate as 
black-box. Indeed, they provide good results in the condition that they are 
applied correctly, but they cannot tell what is the exact relation between the 
input and the output or how they reach that specific decision. That means 
you cannot interpret the behavior of the neural network. Artificial neural 


networks are computationally exhaustive and hard to train. 


The challenge of implementing an artificial neural network is making 
decision of the number of hidden layers as well as the number of neurons in 
each hidden layer. There is no guidance regarding the optimal number of 
hidden layers or the number of neurons per hidden layer. It takes experience 
and an error trial process to find out the optimal number of neurons and 
hidden layers to use in artificial neural network and define the optimal 
architecture. 


Another challenge is to find the right activation function as well as the right 
optimization algorithm to train the artificial neural network. Hence, there 
are lot of choices and elements in the artificial neural network to adjust in 


order to find to best representation of your artificial neural network. 


That might make the process very complex and time consuming. Moreover, 
artificial neural networks are only effective on large datasets. If the small 
dataset is available, it is better to use another alternative method. Artificial 
neural networks are prone to overfitting meaning that you can achieve a 


high performance on the training dataset. 


But when the artificial neural network is applied on a similar independent 
dataset, that high performance drops drastically. Because artificial neural 
networks are better used and trained on large dataset, the training process 
might be very time consuming and computationally intensive. Parallel 
computing can be used but it might be a complex process to implement 


such computing. 


3.5 How to define and train an artificial neural 


network 


Let’s summarize first what we have leant in this chapter about artificial 
neural networks and remember all components of a network. An artificial 
neural network is some non-linear function applied on a linear function 


which is weighted combination of the inputs. 


Now each the linear function is what connects the neurons to each other and 
non-linear function, which is the activation function, is what gives the 
artificial neural network its non-linear ability to sole complex problems. So, 
in order to define an artificial neural network, we need to define the number 
of layers and the number of neurons for each layer. We also need to fix an 


activation function. 


Once these elements are defined, we can think of the training process of the 
artificial neural network. Basically, we need to estimate the optimal weights 
of each neuron. The training process is an iterative procedure that implies 


running the artificial neural network. 


Then evaluating the output provided according to some weights assigned to 
the neurons. Once the output is evaluated, we correct the weights based on 
the evaluation of the output. Then we re-run the artificial neural network 


with the corrected weights and so on until we achieve a good performance. 


This process of going back and forth is called feedforward propagation and 


back propagation. We will explain these two processes in details later in this 


section. This iterative procedure of training the artificial neural network 
requires a definition of a loss or objective function as well as an 


optimization algorithm. 


We also need three independent datasets to train and test the artificial 
neural network. Now that you can see the big picture and what is involved 
in defining and training an artificial neural network, let’s dig in into each 


element. We will start by some rules to build an artificial neural network. 


Rules to follow in building _an artificial neural 


network 


The number of neurons per layer and number of layers are important 
elements in artificial neural network. As we mentioned before there is no 
guidance on the optimal dimension to define an artificial neural network 


structure. It is a black art and depend on the expertise of the modeler. 


However, there are some rules to follow that are recommended in the course 
of the years of applying artificial neural networks. In general, the number of 
neurons in the hidden layers should increase if the problem is complex and 
the relationship that relates the input data to the output data is complicated. 
That is if the number of feature data used to predict the output is high than 


the number of neurons should also be high. 


The quantity of data to train the artificial neural network can be a good 
indicator of how many neurons can be used. In fact, it the quantity of the 
training dataset can provide a maximum bound of number of neurons to be 
used. To get an estimate of this maximum number, you can divide the 
number of elements you have in the training dataset by the number of 


neurons you have in the input as well as the output layer. 


The computed ratio should be divided by a factor ranging from 5 to 10. In 
general, a small factor is applied for noisy data. The more hidden layers you 
have the more likely you increase the efficiency of the artificial neural 
network. Basically, if the problem being solved can be downsized into sub- 


problems or solved in several steps, then it is suited to add more layers. If 


the problem cannot be solved in several steps, then the extra-layers might 
just serve as a memory and do not add valuable addition in the artificial 


neural network. 


In this case the artificial neural network is highly prone to overfitting which 
makes not applicable to data different than the training dataset. Overall, if 
you have dense network meaning too many neurons and too many layers, 
the higher is likely that the artificial neural network is prone to overfitting. 


Data for training the artificial neural network 


Data is also is an important element that dictates the performance of an 
artificial neural network. It is a crucial component to build an artificial 
neural network. Assuming that you pre-processed the data. In other, let’s 
assume that you cleaned the data from any plausible outliers and converted 
the data into a format that is ready to be used by the neural network. 
Typically, you should split the data into three datasets. The first dataset that 
we call the training dataset is used to train the model using an optimization 
algorithm. The neural network optimal parameters i.e. weights are 


estimated using this training dataset. 


Then, the neural network is applied to predict the output of the second 
dataset that we call the validation dataset. After evaluating the output of the 
validation dataset, if the results are not satisfactory, we train another time 
the neural network by changing some elements that can be the optimization 
algorithm, the model structure for example. Then once again the neural 


network is applied to the validation dataset to evaluate its accuracy. 


We repeat the process until we reach a satisfactory performance. Note that 
in this process of using the training and validation dataset, the training 
process is not completely independent from the validation dataset. The 
reason for that is that we adjust some choice in the training process 
according the performance of the model on the validation dataset although 
this dataset is not used explicitly during the optimization. That is why we 


have a third training dataset that we call the test dataset. 


This dataset is used once the artificial neural network is trained and the 

parameters as well as all the elements of the network are fixed and no other 
changes are to be made in the neural network. The evaluation of the neural 
network of predicting the test dataset output should proved an independent 


and objective accuracy evaluation of the neural network. 


Loss function 


Loss function or what is commonly called in optimization domain objective 
function is simply a mathematical function that measures or reflects how 
accurate a model prediction is. The most direct way to measure the 
efficiency of a model is to compute the difference between the model output 
and the actual target output. In short, the loss function reflects the model 
error. It evaluates a candidate set of parameters and returns a value that is 


the model error with respect to this parameterization. 


The widely used loss function is the Mean Squared Error. This loss function 
is computed as the mean value of the differences between the target values 


and the estimated values. If Y,., and Y,,,,are the actual true value and the 
predicted values resp., the Mean Square Error is: L = (1/m) *( >} (Y,,, - 


Y rea)” ) Where m is the number of observations of the output. The 
difference between the actual true value Y,.,and the predicted output value 


Y yea 18 Called the residual. The returned value is always positive whether 


the difference between the target and the estimated values is positive or 


negative. 


A perfect Mean Squared Error value is 0. This loss function is to be 
minimized and does not reflect whether the model over estimate or 
underestimate the target value. The Mean Squared Error function is 
typically used when the output values are continuous values or for linear 


regression. In artificial neural networks, the Mean Square Error may cause 


an issue of slow convergence if the Sigmoid function is used as activation 


function. 


A variant of the Mean Square Error is the Mean Squared Logarithmic Error. 
What differentiate the Mean Squared Logarithmic Error from the Mean 
Square Error we just described is the application of a logarithm function on 


both the predicted value of the output Y,,.,and the true value of the output 
Y,; Lhis function is given by the following equation: 


L=(1/m) * (¥ [log (Y,, + 1) - log (Yjrea + 1)]?), with m number 


observations of outputs. 


The Mean Squared Logarithmic Error does not penalize heavily when the 
residual of the actual value and the predicted value is very large. It 
penalizes more if the model is underestimating the target values than it is 
overestimating the target values. Generally, the Mean Squared Logarithmic 
Error is similar to the Mean Square Error if the residual is small and if the 


act 


residual between the actual value Y,., and the estimated value Y.,., is very 


large or one of the values either the estimated or the target value is large, 
the value of the Mean Squared Logarithmic Error is very small or negligible 


compared to the Mean Square Error. 


Other loss functions used is machine learning and for neural networks is the 
norm L1 and the norm L2. The norm L1 simply the sum of the absolute 


value of the residuals between the predicted values of the output Y_, and 


pred 


the actual true values of the output Y,.,. The L1 norm loss function is given 


by the following equation: L = ¥) | Y,4.- Yreqal- The L2 norm is similar to the 


L1 norm by summing the square of the residuals and similar to Mean 
Square Error without a division on the number of observations of the output 


m. Mathematically the L2 norm loss function is given by the equation: L = 
py ( Yaa = pee Je 


Mean Absolute Error is another loss function that has similar properties of 
the L1 norm function. It is computed as the sum of the absolute value of 
residuals between the target and predicted output values divided by the 


number of observations of the output. Mathematically the equation of the 


Mean Absolute Error is: L = ¥° | Yau - Ypreal’» These loss functions may 


seem similar however they have different properties. 


For instance, the Mean Squared Error makes it easy to calculate the gradient 
unlike the Mean absolute Error. The L2 norm and Mean Squared Error 
penalizes heavily on the large values of the output especially when the 


residuals between the target and the predicted value is large. 


The reason for this is that they apply a square to the residuals. In contrast, 
the L1 norm and the Mean Absolute Error are less sensitive to the large 


values because they do not use a square. 


A variant of the Mean Absolute Error is the Mean Absolute Percentage 
Error. This loss function as it name suggest, provide the model error as 
percentage compared to the actual true value of the output. The Mean 


Absolute Percentage Error is given by the following equation: 


L = (W/m) * YC Yaa- Yorea) / Yact! * 100, with m is the number of 


act 


observations of outputs. The Mean Absolute Percentage Error function 


cannot be used if the output values have zero values. 


The reason for that is because we by the target actual value of the output 


Y,<, and we cannot divide by 0. Another downside of the Mean Absolute 


Percentage Error is does not provide an upper bound for large values of 
overestimated target value. In fact, when the output values or the residual 
between the actual and the predicted output values is small, the Mean 
Absolute Percentage Error is less than 100, but when the outputs are very 


large the value of the Absolute Percentage Error can exceed 100. 


The Mean Absolute Percentage Error can be useful to compare between 
different models with different outputs because it provides a relative error. 
So far, we just presented loss functions that can be used for linear 
regression or when we are trying to optimize a variable that have 
continuous values. Next we will talk about loss functions that can be used 
when the variable is binary variable. 


Another widely used loss function is the logistic loss or logarithm loss. The 
logistic loss is also called the Cross Entropy. This loss function is mainly 
used when the output values are binary (i.e. 0 or 1) or represents a 


probability. Estimated probabilities are compared to the target output value. 


Then a score is computed according to the distance between the target value 
and the estimated value in order to penalize that probability. In fact, it 
applies a logarithmic penalty where small distance has a small score and the 


large distance are scored with high values. The logarithmic loss is to be 


minimized. Smaller values represent small model error and closer the value 
is to 0 the perfect the model is. Mathematically the Cross Entropy or the 
logistic loss function is defined by the equation: L = -(1/m)* > LY, * 


log (Yirea) + (1 -Yag) * log (1-1 Y,.24) 1], where m is the number of 


observations of the outputs. The Cross Entropy can be used for 
classification in multi-classes. Another logarithmic type of loss function 
that can be used is the Negative Logarithmic Likelihood. 


This loss function is similar to the logistic loss function and is computed as 
follows: 
= -(1/m)* ¥ log (Y,,,,). This function cannot really be used for the 


pre 
training a machine learning model. This function is rather used in artificial 
neural networks in order to measure the accuracy of a classifier model. The 
Negative Logarithmic Likelihood is only used when the model outputs are a 
probability for an input to belong to each class rather than just an estimate 
of the high likely class. 


Another used loss function for classifier models is the Hinge loss function. 
This loss function is also called max-margin objective function. The hinge 
loss function is computed directly on the output of the model classification 


or probability not the class tag. 


For a target actual output Y,., and the predicted value by the model Y,,,.., the 


hinge loss function is as given by the following equation: L = (1 / m) * ¥; 


max (0, 1 - Y.4 * Yjreq) Where m is the number of observation of outputs. 


When both the actual and the predicted values are positive or negative (i.e. 
have the same sign) it means that the model predicts the accurate class. The 


hinge loss function is 0. 


When the actual and the predicted output values have opposite signs, which 
means the model does not predict the right class, the hinge loss function is 
increasing. There is a more generalized formula of the hinge function that is 
given by the following equation: L = (1 /m) * }' max (0, a- Yiu. * Yea) 


where a is a user defined value and m is the number of observation of 


outputs. 


The hinge loss function has a variant function called the Squared Hinge loss 
function. This loss function applies a squared value to max-margin value. 
The squared Hinge loss function is given by the following equation: L = (1 / 
m) * ¥ (max (0, a- Yiu * Yorea))*- 


To train a machine learning model or an artificial neural network, the loss 
function should be chosen according to the problem being solved. In this 
section we presented different loss functions that can be used. So, if the 
model or the artificial neural network is classifier, a logistic or a loss 


function that uses a logarithm should be used. 


In contrast, if linear regression problem is being solved a loss function like 
the Mean square Error or norm L1 or L2 can be used. Keep in mind that 


some loss function penalizes heavily on large values like the L2 norm or the 


Mean Squared error. If we want to emphasize the small values, the Mean 


Squared Logarithmic Error is more suited than the other loss functions. 


Feedforward and back propagation 


The process of feedforward and back-propagation was proposed back in the 
70s. Since then it is the most efficient and widely used process to train 
artificial neural networks. This process has proven its efficiency to solve 
non-linear ill-defined complex problems. So how does feedforward and 


back propagation work? 


The feedforward is the first step to take. It consists of running the neural 
network using the training dataset as inputs. Basically, by feedforward we 
mean passing the training dataset and information through the network from 
the input layer to the output layer. In this process all neurons apply their 
calculation to convert the information and pass it the output layer as final 
prediction according to the input layer. Afterwards the loss function is 
calculated. We want the value of the loss function to be as close as possible 
to 0. In other words, we want the predicted output value to be as similar as 
possible to the target actual output value. After the evaluation of the loss 
function, its value is fed backwards within the neural network to adjust the 


weights. 


The value of the loss function is processed back from the output layer to the 
neurons in the hidden layer connected to it and so on until the input layer. 
The loss function is propagated backwards as a signal where neuron 
captures just a fraction of that signal according to the contribution or impact 


(i.e. weight) has to compute the predicted output. 


The process is applied to all layers in the network. After the back 
propagation, the weights of each neuron are adjusted according to loss 
function signal propagated. Typically, the weights are adjusted using a delta 
rule which is the loss function signal. 


The weights are adjusted in small increments based on the gradients of the 
loss function. This technique is called the gradient descent. In this algorithm 
the gradient point towards the direction to follow to reach the global 
minimum of the loss function. We will explain in details the gradient 
descent algorithm in the next chapter when we will present the optimization 


algorithms. 


The iterative process of training an artificial neural 


network 


To summarize the training process of an artificial neural network is an 
iterative process where the weights are adjusted using the training dataset. 
The process alternates between feedforward propagation and back 


propagation. 


The same training dataset is fed to neural network several times and the 
loss function is computed and back propagated and the weights are adjusted 
accordingly at each iteration until a satisfactory performance is achieved. In 
some cases, the neural network does not learn. Simply the algorithm does 
not converge and the loss function is very high. In other words, the 
predicted output by the neural network diverge greatly from the desired 
output values. The element that should be checked in these cases is the data. 
There are two scenarios that might explain the reason why the neural 


network is unable to learn from the data. 


The first scenario is there insufficient information within the data or 
insufficient data feature that neural network can learn from to predict the 
output. The second scenario is that there is insufficient a mount of the data 


that will enable the neural network to learn from the data. 


Hence a special attention should be given to the data in order to train an 
effective neural network. Overall, training an artificial neural network is an 
optimization problem that requires an objective function or loss function to 


optimize and an optimization algorithm to perform the optimization. In the 


next chapter we will go through some optimization algorithms that can be 


used to train an artificial neural network. 


Chapter 4: Learning algorithms 


So far, we have learnt the fundamentals of machine learning and artificial 
neural network. We also learnt the big picture of the process of training an 
artificial neural network. In this chapter we will go into the details of the 


training process of a machine learning model and artificial neural network. 


In this chapter, we will present in details how the optimization algorithms 
work and how they are implemented for training machine learning models. 
But first let’s define some terminology we will be using in this chapter. We 
are going to make first the difference between a parameter and a 
hyperparameter. A parameter is typically a configuration of the model that 
defines that distinguish the model from other models. The parameter can be 
approximated from the data. In artificial neural networks the weights are the 


parameters that makes each artificial neural network unique. 


Hyperparameter, however, are the configurations that do not affect the 
model and does not define model. The hyperparameters are parameters or 
configurations that defined by the modeler to define and train the model or 
tune the learning algorithms. In an artificial neural network 
hyperparameters are for example the number of neurons, hidden layers, the 
algorithm optimizer and the number of iterations among others. The number 
of iterations is also called epoch and refers the number of times the artificial 
neural network has performed the feedforward and the back-propagation 


processes to learn. 


The number of epochs should be increased until the evaluation criteria of 
the artificial neural tested on the validation dataset is decreasing. Now that 
we distinguish between parameters and hyperparameters, let’s discover 
what algorithms we can use to train machine learning and artificial neural 


networks. 


Optimization algorithms are simply algorithms that has the goal of 
optimizing (i.e. minimizing or maximizing) an objective function. There is 


a variety of optimization algorithms. 


We cite the gradient descent algorithms, genetic algorithms, Newton 
method, Simulated Annealing, Evolutionary algorithms among others. In 
general, we can distinguish zero order algorithms, first order algorithms and 


second order algorithms. 


The zero-order algorithms are algorithms that are also called derivative free 
and do not use any form of the derivatives of the loss function. These 
algorithms are used when the derivative of the loss function is very 


complex to obtain. 


The first order algorithms typically use the gradient of the first derivative of 
the loss function and the second order algorithms use the second derivative 
of the loss function. The derivatives are computed in respect to the 


parameters. 


The widely used algorithm in machine learning is the gradient descent 


algorithm which belongs to the first-order algorithm category. 


There is some variation of this algorithm. In this chapter, we will go into 
details of applying the gradient descent algorithm as well as some of its 


variations. 


4.1 Gradient descent algorithm 


The gradient descent is one the most widely used algorithm in machine 
learning for its efficiency and is very straightforward. The gradient descent 
algorithm as its name indicate is based on the gradient of the loss function. 
In other words, it is based on the first derivative of function. Remember that 


the first derivative is null that means it is a minimum value. 


Therefore, the gradient descent updates the weight parameter based on the 
gradient of the function at each layer starting from the output layer to the 
input layer. That is the specific reason why activation functions must have a 


derivative or gradient defined in order to apply the gradient descent. 


So, the gradient points toward which direction to follow in order to 
minimize the loss the function. Typically, the parameters are updated 
according to the negative value of the gradient. The value of the gradient is 
factorized by a parameter called the learning rate. At each iteration the 


weights are updated until the minimum is reached. 


The learning rate will dictate how fast we reach that minimum. Now let’s 
see how we would apply this algorithm on a linear regression. Remember 
that in linear regression we are trying to approximate a function f such as 
f(X) = W * X + b where W is the weights and b is the bias. Let’s assume the 


L is a loss function that we are trying to optimize. Typically, the function L 


is something like L = (1/m) * ¥ (Vpredictea~ Yearger)” WHETE Ypredictea lS estimated 
by the model and Y,,,,..s the target value. The loss function we described 


here is the Root Mean Square Error know as the RMSE and is widely used 


in optimization. The loss function L can be written as: 


L = (1/m) *¥ [ (W * X+B) - Yiagerl? We just replaced y,, eaiceaby the its value 


computed by the function f we are trying to estimate. So, first let’s describe 
the steps of the gradient descent and then we will go into details of each 
step. Let’s assume that we have pre-defined learning rate parameter o and a 


number of maximum iterations or epochs N fixed. 


The algorithm steps are summarized as follows: 1) it starts by assigning 
random values to the weights W and the bias b, 2) the algorithm runs the 
model W and b, 3) the algorithm evaluate the loss function, 4) the algorithm 
calculate the value of the partial derivatives SL / SW and SL / 6B, 5) the 
algorithm updates the weights as follows: W = W — a * (6L / 6W) and the 
bias as: B = B— a * (6L/ 5B), then it goes backs to step number until the 
loss function does no longer improve or the maximum number of iterations 


is reached. 


Now you can see from the steps of the gradient descent, we have some 
hyper-parameters to define in order to apply the algorithm for a learning 
process. The first hyper-parameter is the learning rate parameter a. The 
learning rate should be chosen carefully because if it is fixed into a large 
value the algorithm may never converge. The reason for that is that the 
algorithm might miss the minimum. In contrast if the learning parameter is 
fixed into a small value, the algorithm will converge very slowly and then 


be very computationally inefficient. 


To get a better understanding of the impact of the learning rate on the 
convergence of the gradient descent algorithm, let’s take an explicit 
example. Say we have a gradient or a first derivative of a Loss function of 
value 2.3 and the a learning parameter rate is 0.1. Then parameter will be 
updated or more specifically decreased by 0.23. If the learning parameter 
was 0.01 instead of 0.1, then parameter will be decreased by 0.023. 
Typically, the learning parameter is fixed in a value ranging from 0.001 to 
0.3. In case the algorithm does not converge and the learning process in 
successful, it is recommended to test smaller learning parameter values. To 
train an artificial neural network using the gradient descent algorithm, we 
apply the same the steps. In step 3 the weights are updated for each layer by 
computing the gradient for the activation function of each layer starting 
from the output layer to the input layer as we explained in the previous 


chapter during the back-propagation process. 


The initialization of the weights is very important in an artificial neural 
network training using the gradient descent algorithm. The reason for that is 
the neuron has similar weights then they will have the similar gradient. 
Hence these neurons will catch the same characteristics in the learning 


process. Therefore, it is recommended to initialize the weights randomly. 


The gradient descent is a deterministic algorithm meaning applying the 
algorithm over and over with a fixed hyper-parameter will provide the same 
results. The algorithm is applied on the whole training dataset and uses a 


fixed learning rate. Another better way to it is adjusting the learning rate 


parameter by decreasing its value when we approach the maximum of 


epoch or simply when we approach a solution. 


Therefore, we can use an additional hyper-parameter which is called the 
learning rate decay. This parameter allows adjusting the learning parameter. 
When the algorithm is at the very beginning of the learning process the 
values of the learning parameter are very high to allow for bigger steps to 


update the weights. 


As the learning process goes forward, the learning rate is decreasing in 
order to convergence to the minimum value or the optimal value of the loss 
function. Now that you understand the general concept of the gradient 
descent algorithm, let’s discover the variations of this algorithm which are 
the stochastic gradient descent, the batch gradient descent and the mini- 


batch gradient descent. 


4.2 Stochastic, batch and mini batch gradient 
descent 


First of all, is worth mentioning that all variations of the gradient descent 
algorithm follow the same steps we described in the previous sections. The 
only difference is at what frequency they compute the gradient and update 


the parameters and the quantity of data used to update the parameters. 


The stochastic gradient descent is the variation of the gradient descent that 
updates the parameters more frequently. It actually updates the parameters 
for each entry of the training dataset. In other words, if we have N 
observation in the training dataset, each observation is fed at a time and 


processed through the whole neural network. 


The weights are updated accordingly for each observation of the training 
dataset at a time. This version has some advantages and disadvantages. The 
good side of the stochastic gradient is that it might converge quickly in 
some cases and is less prone to get stuck into a local minimum. However, it 
may become computationally expensive because it uses one observation 


from the training dataset at a time to update the parameters. 


Also, this algorithm does not take advantage of fast computation of 
vectorization of matrix form. The batch gradient descent uses the 
contradictory concept of the stochastic gradient descent algorithm. Instead 
of updating the weights and the bias parameters after each observation is 
processed, the weights are updated after all observations are processed 


through the neural network or through the model. 


Therefore, at each iteration the whole training dataset is process through the 
model and the weights are updated once the loss function and the gradient 
are evaluated for the whole training dataset. The downside of the batch 
gradient descent algorithms is it is high likely to get stuck in a local 
minimum. This algorithm can also cause memory problems in order to 
handle large datasets in one single shot unlike the stochastic gradient 


descent. 


However, this algorithm provides a steady gradient descent unlike the 
stochastic gradient descent. The third variation of the gradient descent 
which is the mini batch gradient descent is a hybrid gradient descent that 


combines both stochastic and gradient descent algorithms. 


The mini batch gradient descent is developed to overcome the drawbacks of 
both stochastic and the batch gradient descent algorithms which are large 
memory requirements, computational efficiency, take advantage of 
vectorization and matrix forms as well as avoiding getting trapped in a local 


minimum. 


So, the concept of the mini batch is basically dividing the training dataset 
into smaller groups with equal size. Each group is called a batch. The same 
principle of the batch gradient descent is applied into each batch of the 
training dataset. Let’s say for example we a training dataset of N 
observations. Then the training dataset is subdivided into 5 batches where 
each batch contain N/5 observations or samples. Then we iterate the 


gradient descent on the five batches. 


Note in this section we talked about the fact the gradient descent is prone to 
get trapped in a local minimum. This typically happens when the loss 
function is a non-convex function. The non-convex functions have several 


local minimums. 


The goal of optimization is finding the global minimum or an its 
approximation. In order to avoid getting trapped in a local minimum, we 
can repeat the learning process with the gradient descent by choosing a 
different random initial weights and bias. We can also use a hyper- 


parameter called momentum can be used. 


The momentum is a parameter that is fixed to a value between 0 and 1. It 
has role of weighting the average of the gradient that was previously 
computed in order to define the step applied to update the parameters. In 
other words, the momentum is a fraction of the past update of the 
parameters that is added to the current update of the parameters. So, the 
parameters are updated as follow considering that the momentum hyper- 
parameter is denoted as 6: 

W,= W, - [6 * W,, + a * (6L / 5W)] and the bias as: 


B,= B,- [5 * B,,+ a * (SL /8B)]. 


To summarize implementing any variant of the gradient descent algorithm 
comes with some challenges. We have seen the algorithm can be trapped in 
a local minimum. To avoid that we can use the momentum hyper-parameter. 
The convergence of the algorithm is highly sensitive to the learning rate 


parameter. 


An additional challenge is that the learning rate is fixed for all model 
parameters. Let’s say for example we have a sparse data or sparse neurons 
due to an activation function. Remember sparsity is desirable characteristic 
in artificial neural network in order to avoid having a dense network. In this 
case, we would want to update the weights at different learning rates where 


the less frequent occurring features are less frequently updated. 


To do so, we can use the Adaptive gradient algorithm within the gradient 
descent algorithm. The Adaptive gradient algorithm (Adagrad) is an 
algorithm that adjusts the learning rate according to the parameters by 
updating parameters of the less frequent occurring features with a high 
learning rate and the parameters of the frequent occurring features with a 


small learning rate. 


The Adaptive gradient algorithm improves the performance of the gradient 
descent algorithm. The advantage of using this algorithm is that adjusting 
the learning rate manually by trying different values is no longer needed 
and is done automatically. Formally, the algorithm adapts each parameter 


using a different learning rate as follows: W,,,= W,— dL * [ a/ square root 


of (DL + ¢)] and B,,,= B, — dL * [| a/ square root of (DL + €)] where dL is 


derivative and DL is the sum of the derivatives of previous iteration with 
respect to all parameters. The downside of this algorithm is that over the 
iterations the learning rate a becomes smaller and smaller as the 
denominator is increasing by adding up the sum of the previous gradients. 
This problem is called the diminishing learning rate. This issue causes that 
algorithm stops learning. Adadelta is a variant of the Adaptive gradient 


algorithm that tries to resolve this issue. 


The learning rate in the Adadelta algorithm is monotonically reduced. In 
fact, the Adadelta algorithm considers only a restricted window of size N of 
previous gradients when updating the parameters. The value of the current 
gradient is also included to calculation of the fraction by which the 
parameter is updated. If 8 are the parameters for examples we try to 
estimate they are updated as follows: 8,,,= 8,+ D8, where DO, = - [ 


RMS(D68,,) / RMS(DL)] * DL where DL is the gradient and D8, is the 


gradient with respect to the parameter 6 at time step t. 


4.3 Adam algorithm 


The Adam algorithm is a recent algorithm developed specifically to train 
artificial neural networks. Specifically, the Adam algorithm was introduced 
by two researchers Diederik Kingma and Jimmy Ba in 2015 through a 
poster entitled Adm: A method for stochastic Optimization ‘. The name of 


the algorithm originates from Adaptive Moment estimation. 


It started to gain some popularity in machine learning and has proven 
efficiency in some studies over the gradient descent algorithm. The authors 
of this algorithm claim that implementing the Adam algorithm for non- 
convex optimization like training artificial neural networks has some 


benefits. 


These benefits include that the algorithm is simple to use, is 
computationally efficient, does not require large memory, the algorithm is 
adapted support handling large dataset or optimizing multiple parameters, is 
suited for non-stationary objectives and for solving problems with objective 
function having noisy gradient or sparse gradient as well as the little 
requirements to adjust the hyper-parameters. The Adam algorithm is still in 
continuous improvement and update. In this section we will go through the 


basics of this algorithm. 


The Adam algorithm uses some similarities of the stochastic gradient 
algorithm with a momentum parameter included. In fact, the Adam 


algorithm adapts the learning rate parameter according to the parameters. In 


other words, the algorithm adjusts different learning rates for the parameters 
of the model. 


The Adam algorithm uses both the first and second moment of the gradient 
in order to adjust the learning rates. In fact, the Adam algorithm, as 
introduced by its authors, is developed to take advantage of two algorithms 
the Adaptive Gradient Algorithm as well as Root Mean Square propagation. 
The adaptive gradient algorithm adjusts the parameters according to the 


frequency or occurrence of feature. Hence it considers the sparsity of data. 


The Root Mean Square Propagation is the part of the Adam that considers 
the momentum hyper-parameter by adjusting the parameters according to 
their change in the previous iteration. The particularity of the Adam 
algorithm is that uses the uncentred second moment or the uncentred 
variance instead of the first moment which is the average used the gradient 
descent when the momentum hyper-parameter is used. By uncentred second 
moment means that the mean is not subtracted when computing the second 


moment which is the variance. 


The Adam algorithm use the squared gradient in order to compute the decay 
learning rate as a function of the decay learning rate of the previous 
iteration as follows: 

m,= B61 * m,,+ (1 - B1) * DL; v,= 62 * v,, + (1 - B2) * |DL|?; where m, is an 
estimate of the mean of the gradient and v, is an estimate of the uncentered 


variance of the gradient. The parameters $1 and $2 are the learning rates 
decay and DL is the gradient. The two estimates of the mean and the 


uncentred variance are initialized to null. To update the parameters, they 


apply the following equation: W,,, = W,- [a * M, / (square root (V,) +e] and 


the bias as: 
Bi = B,- [a * M,/ (Square root (V,) + €]; where M, = m,/ (1 - B1,) and V= 


v, / (1- B2,). The parameters M, and V, are a biased correction of the two 
estimates of the mean and the variance m,and v, of the gradient. The 


recommended default values of the B1 and 62 by the authors is 0.9. 


Regarding the parameter ¢ the recommended default value is 10°. As for 
the gradient descent algorithm, Adam algorithm has some variants to 
resolve some its issues. We cite the Adamax and Nadam algorithms. We 


will go in details of these algorithms in the next section. 


4.4 Adamax and Nadam 


Adam algorithm uses the L?norm of the previous gradients to update the 


parameters. Adamax simply uses L® norm instead of the L? to update the 


parameters. This norm allows to provide stable results. 


So, the estimates of the variance become: v,= B2 * v,, + (1 - B2) * |DL|” = 
max (62 * v,, |DL]) instead of v,= B2 * v,,+ (1 - B1) * |DL|’. Now the 
parameters are updated as: W,,,= W,- [a * M,/ V,] and the bias as: B,,, = B, 
-[a* M,/ V,]. Here the v, is based on the maximum function so it is not 
subject to bias toward 0. Therefore, a bias correction is not required for v,. 


The recommended values of the hyper-parameters B1 and B2 is 0.9 and 0.99 


resp. and 0.002 for the a parameter. 


Nadam is another variant of the Adam algorithm that combines Adam 
algorithm with the Nesterov-accelerated gradient. The Nesterov accelerated 
gradient is another variant that belongs to the adaptive learning rate 


algorithms. 


This algorithm computes the derivative with respect to the following step 
instead of the current step. So, the updated value of a parameter 0 is as 


follows: 8 = 6 - v, where v,= 6 * v_, + a * DL (0-6 * v,,) and 8-6 * v,, is 


the gradient according to the following step. The Nesterov accelerated 
gradient is an improvement of the momentum. So, the Nadam algorithm 


uses the concept of the Nestroc accelerated gradient within the Adam 


paradigm. The Nadam is mainly used when the derivative is noisy or if the 
derivative has high deviation. In the Nadam algorithm the learning process 
is rushed by the sum of the exponential of the decay learning of the average 


of the previous and the current derivative. 


4.5 How to choose an algorithm 


In this chapter, we have seen that there are different variants of the gradient 
descent algorithms. So now you are wondering how to choose a variant to 
implement. One single rule can be applied is that if the data present some 
sparsity, then it is better suited to implement some adaptive learning 
algorithm whether it is the Adam algorithm or the gradient descent with an 


extension as the Adaptive gradient algorithm. 


The advantage of these algorithms is they don’t require adjusting the 
learning rate manually. Also, with these algorithms it is high likely to reach 
performant results just with default values of the hyper-parameters. 
Although the computational efficiency of the Adam algorithm, some studies 
showed that the Adam algorithm is not adapted for some domains and may 


not converge into optimal results. 


In general research has so far showed that for model that solve problems 
like image recognition or classification, they are better trained using the 
gradient descent algorithm with momentum. However, if a problem is 
complex and is solved by a deep neural network, it best suited to use an 
adaptive learning rate algorithm like Adam or the Adaptive gradient 


learning. 


Conclusion 


Thank you for making it through to the end of Machine learning for 
beginners: step-by-step guide to learning and mastering machine learning 
for absolute beginners, let’s hope it was informative and able to provide you 


with all of the tools you need to achieve your goals whatever they may be. 


The objective of this book is to provide a beginning guide for the absolute 
beginners with no prior knowledge in machine learning and want an 
introduction of the general concepts of machine learning and artificial 
neural networks. This book does not require any pre-requisite skills to 


understand and follow the concepts presented. 


The book covers the concept of machine learning, the types of machine 
learning namely supervised, unsupervised, semi-supervised and 
reinforcement learning. It also covers the basics of artificial neural 
networks, activation functions, the types of artificial neural networks 
including the perceptron, the feedforward neural networks, recurrent neural 


networks, convolutional neural networks and the modular neural networks. 


The book presents how the artificial neural networks as well as the loss 
function that can be used depending on the problem that is being solved by 
the artificial neural network. Finally, the book presents the widely used 
algorithm for training artificial neural networks as well any machine 
learning model the gradient descent as well as its variants the stochastic 


gradient descent, batch gradient descent and the mini-batch gradient. 


Machine learning and artificial neural networks are an active research field. 
Research is interest in improving the machine learning paradigms, the 
structure of artificial neural as well as optimal methods for training machine 


learning model and artificial neural networks. 


The reason for that is that machine learning is black art where there is no 
clear guidance on how to apply it. Apply a machine learning model or an 
artificial neural network requires a trial and error procedure in order to find 
the optimal settings. The success of implementing whether a machine 
learning model or an artificial neural network depends on the success of the 
trial and error procedure as well as the experience of the modeler to make 


decisions regarding the model settings. 


Once you have acquired the knowledge and the skills presented in this 
book, you should be able to make a clear judgment on which strategy and 
model type you should choose to solve a specific problem with machine 


learning. 


Finally, if you found this book useful in any way, a review on Amazon is 


always appreciated! 
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Introduction 





Congratulations on downloading Machine Learning With Python and thank 
you for doing so. 


This book is intended to be an initiation to learn machine learning with 
Python programming for absolute beginners that have no background in 
programming. It provides the basics of both machine learning and Python 
programming. It also provides a guide to use Python libraries to build 


machine learning models. 


This book has mainly four chapters that will help you understand models of 
machine learning and to how to carry them out in a Python environment. 
Chapters 1 and 2 are dedicated to machine learning principles, the third 
chapter for Python programming and its basic syntax. Finally, the four 
chapters provide applications of machine learning with Python. 


The first chapter discusses the fundamentals and concepts of machine 
learning. You will learn the different learning machine paradigms, namely 


supervised, unsupervised, and reinforcement paradigms. Basically, this 


chapter helps enhance understanding the different paradigms of machine 
learning, the cases when each paradigm is applied. This chapter also 
explains some widely used algorithms in machine learning and the process 


to create a model of machine learning. 


In the second chapter, we will discuss the artificial neural networks, a tool 
widely used in machine learning. Because, the artificial neural network is a 
machine learning branch by itself, it is best covered in a separate chapter. 
You will learn the principle of neural networks, types of neural networks, 
and how to train these neural networks. This chapter explains in detail the 


different components of an artificial neural network. 


In the third chapter, you will tackle Python programming. You will learn 
why Python is useful to develop machine learning models. You will also 
learn how to get started with Python, running Python programs, and basic 
syntax of Python programming. You will also explore some useful 
platforms that use Python if you prefer using a graphical user interface 


instead of a command line. 


In chapter four, you learn how to apply machine learning with Python. This 
chapter contains some machine learning applications that were discussed in 
the previous chapters. You will get through detailed examples in order to 
run machine learning models using Python tools you learned in chapter 3. 
The examples cover the paradigms of machine learning and steps to 
develop a multilayer neural network without relying on the pre-coded 
functions in Python as well as using the built-in function of Python 
libraries. Through these examples, you will apply Python skills to process 


and analyze and visualize the dataset. 


Several books discussing this particular subject exist on the market. Thanks 
for deciding to read this book. To ensure the book covers the most relevant 


information for a beginner, a lot of effort was taken. Enjoy reading and 


learning! 


Chapter 1: The Concept of Machine Learning 


Derived from Artificial Intelligence, the concept of machine learning is a 
complete area of study. It focusses on developing automated programs to 
acquire knowledge from data in order to make a diagnosis or a prediction. 
Machine learning relies on the concept that machines are able to learn, 
identify trends, and provide decisions with minimal human intervention. 
These machines improve with experience and build the principals that 


govern the learning processes. 


There is a wide variety of machine learning uses, such as marketing, speech 
and image recognition, smart robots, web search, etc. Machine learning 
requires large dataset in order to train the model and get accurate 
predictions. Different types of machine learning exist and can typically be 
classified into two separate categories supervised or unsupervised learning. 
Other algorithms can be labeled as semi-supervised learning or 


reinforcement machine learning. 


Supervised learning is mainly used to learn from a categorized or labeled 
dataset than applied to predict the label for the unknown dataset. In 
contrast, unsupervised learning is used when the training dataset is neither 
categorized nor labeled and is applied to study how a function can describe 
a hidden pattern from the dataset. Semi-supervised learning uses together 
categorized and non-categorized dataset. Reinforcement is a method that 
relies on the trial and error process to identify the most accurate behavior in 


an environment to improve its performance. 


This chapter will go through the details for each type of machine learning 
and explains in-depth the differences between each type of learning and 
their pros and cons. Let’s start with the supervised learning that is 
commonly used and the simplest learning paradigm in machine learning. 
Before we dive into the details about machine learning when machine 


learning is the best approach to solve a problem? 


When its best to use machine learning? 


It is crucial to understand machine learning is not the go-to approach to 
solve any problem in hand. Some problematics can be solved with robust 
approaches without relying on machine learning. Problematics with few 
data with target value that can easily be defined by a deterministic 
approach. In this case, when it is easy to determine and program a rule that 
drives the target value, machine learning is not the best approach to follow. 
Machine learning is best used when it is impossible to develop and code a 
rule to define the target value. For instance, image and speech recognition is 
a perfect example of when machine learning is best used. Images, for 
example, has a lot of features and pixels that a simple human task are very 
hard to implement in order to recognize the image. A human being can 
visually recognize an image and classify it. But how to develop an 
algorithm and a rule-based approach is exhausting and not very effective for 
image recognition. So, in this case building an image dataset and flag each 
image with its specific contents (i.e., animal, flower, object, etc..) and use a 
machine learning algorithm to detect each category of images is very 
efficient. In short, machine learning is very handy when you have a number 


of factors that impacts the target value with little correlation. 


Machine learning is also the best approach to automate a task for large 
datasets. For example, it is easy to detect manually a spam email or a 
fraudulent transaction. However, it is very time consuming and tedious 
tasks to the same task for a hundred million emails or transaction. Machine 
learning is very cost effective and computationally efficient to handle large 


datasets and large-scale problems. 


Machine learning is also best used in cases where human expertise to solve 
a problem is very limited. An example of these problems is when it is 
impossible to label or categorize the data. Machine learning in this situation 
is used to learn from the datasets and provide answers for the problems or 


the questions we are trying to solve. 


Overall, machine learning is best used to solve problem when: 1) human 
have the expertise to solve the problem but it almost impossible to develop 
easily a program to mimic the human task, 2) human have no expertise or 
an idea regarding the target value (i.e., no label or classified data), 3) human 
have the expertise and knows the possible target values but it has cost- 
effective and time consuming to implement such an approach. In general 
machine learning is best used to solve complex data-driven problems like 
learning behaviors for clients targeting or acquisition, fraud analysis, 
anomaly detection in large systems, diseases diagnostic, shape/image and 
speech recognition among others. Problems when few data are available, 
and human expertise can be easily programmed as rule-based approach it is 
best to use a deterministic rule-based method to resolve the problem. The 
large dataset should be available so as machine learning to be efficient and 
effective — otherwise, issues of generality and overfitting rise. Generality 
means the ability of a model to be applied in case scenarios similar to case 
scenarios that served to build the model. When machine learning models 
are built on a small dataset, they become very inefficient when applied on 
new datasets that they have not been exposed to. Hence, their applicability 
becomes very limited. For example, building a model that recognizes an 
image as a cat or dog image, then apply the same with new images data of 
other animals. The model will give an inaccurate classification of the new 


dataset of the other animals like dog or cat image. Overfitting is when the 


model shows a high accuracy when applied on the training data, and its 
accuracy drops drastically when applied to a test data similar to the training 
data. Another issue with machine learning that should be considered in 
developing a machine learning model is the similarity between inputs which 
are associated with several outputs. It becomes very difficult to apply a 
classification machine learning model in this case as similar inputs yield to 
different outputs. Therefore, the quality and quantity of data are very 
important in machine learning. One should keep in mind that not only the 
quantity of data but also the quality of data affects the accuracy and 
applicability of any machine learning approach. If the right data is not 
available, collecting the right data is crucial and is the first step to take in 
order to adopt a machine learning approach. Now, you have learned when it 
is useful to adopt a machine learning approach, when you should avoid 
machine learning and when a simple rule-based deterministic approach is 
the simple way to solve a problem. Next, you will learn the different types 
of machine learning that you might use when each type is applied, the data 
that it requires, widely used algorithms and the steps to follow to solve a 


problem with machine learning. 


What is supervised learning? 


In supervised learning, we typically have a training data set with 
corresponding labels. From the relationship that associates the training set 
and the labels, we try to label new unknown data sets. To do so, the learning 
algorithm is supplied with the training set and the corresponding correct 
labels. Then, it learns the relationship between the training set and the 
labels. That relationship is then applied by the algorithm to label the 
unknown data set. Formally, we want to build a model that estimates a 
function f that relates a data set X (i.e., input) to labels Y (i.e., output): 
Y=/(X). The mathematical relationship f is called the mapping function. 


Let’s consider we have an ensemble of images and try to label it as a cat 
image or not cat. We first provide as an input to the learning algorithm 
images (X) and labels of these images (cat or not cat). Then, we 
approximate the relationship f that estimates Y according to X as accurately 
as possible: Y=f(X)+¢ , € is an error which is random with a mean zero. 
Note that we are approximating the relationship between the dataset and the 
labels and we want the error € as close as possible to 0. When ¢ is exactly 0, 
that means the model is perfect and 100% accurate, which is very rare to 


build such a model. 


Typically, a subset of available labeled data, which is often 80%, is utilized 
as a training set to estimate the mapping task to build such a model. The 
extra 20% of the labeled data is utilized to assess the model’s efficiency and 
precision. At this step, the model is fed with the 20% data, and the predicted 


output is compared to the actual labels to compute the model performance. 


Supervised learning has mainly two functions, namely, classification or 
regression. Classification is used when the output Y is a quality or category 
data (i.e., discrete variable) whereas regression is used when the output Y is 
quantity data (i.e., continuous numerical values). Classification aims at 
predicting a label or assigning data to a class to which they are most similar. 
The output Y is a binary variable (i.e., 0 or 1). The example is given above, 
labeling images as cat or no cat is an example of classification. The model 
can also be a multi-class classification where the model predicts different 
classes. For example, Outlook classifies mails in more than a category like 
Focused, Other, Spam. There are a number of algorithms that can be used, 
such as logistic regression, decision tree, random forest, multilayer 
perceptron. Regression is used when we want to predict a value such as a 
house pricing, human height, or weight. Linear regression is the simplest 
model for this type of problems. 

The disadvantage of supervised learning is the fact that they cannot process 
new information, and training should be reconsidered when new 
information is available. For instance, we have a set of training images of 
dogs and cats, and the model is trained to label images as dog image or cat 
image. In other words, we have developed a model with two categories of 
dogs and cats. When this model is presented with new images of other 
animals, for example, a tiger, it labels incorrectly the tiger image as a dog or 
cat image. The model does not recognize the tiger image, but it provides a 
classification of the image in a category. Therefore, the model should be 


trained whenever new information is available. 


In the next sub-sections, we cover the most commonly used algorithms to 


solve regression and a classification problem. 


Linear Regression is a model typically used, as mentioned before, for 

supervised learning with a regression type problematic where the output is a 

continuous variable. Linear regression describes the relationship between an 

independent variable or multiple variables X and a target output variable Y: 
Y=WX3+B. 


If X is a single variable, then W and B are constant parameters to be 
determined. Otherwise, W is a matrix, and B is the vector to be determined 
if we are dealing with multiple input variables. W is called the coefficients 
or weights of X, and B is called the intercept or Bias. 

In the training process, we try to find the best parameters W and B that 
provides a predicted value of Y that are as similar to the actual values of Y. 
Then, optimal W and B are used to predict Y values from the input X. In 


this case, how do we know what values are optimum for B and W? 


To identify the best parameters, i.e., weight W and bias B, we use what is 
called a cost or loss function J. A cost function quantifies how close the 
predicted value is to the target value of Y. Basically, this function quantifies 
the model error between the estimated and targeted values of Y. This 
function is to be minimized by the learning algorithm. Mathematically the 


cost function is: 


Loss function =(1/ m) 7) Vitel Vises): 


The cost function here is the Root Mean Squared Error, abbreviated RMSE, 
between the model’s projected value and the target actual value of Y. Here, 
we are using the least square technique to approximate W and B’s optimum 


values. 


We can use an optimization algorithm like the Gradient Descent algorithm 
byfupdating intercept B and coefficient W values in such a way that it 
minimizes the value of the cost function. This algorithm starts with random 
values of W and B and updates these values at each iteration until it reaches 
a minimal value of the cost function J as follows: 

W=W-a*(6J/S5W) 

B=B- a*(6J/5B) 
Where a is a learning rate parameter, 5J is the derivative of the cost or loss 
function. We will learn the importance of the gradient of the cost function in 
the neural network section. The learning rate a is a parameter that 
determines how fast the algorithm learns. If the learning rate parameter has 
a low value, the slower the Weights and the bias parameters are updated. 
The algorithm will then converge very slowly. On the contrary, if the 


learning rate parameter has a high value, the faster the algorithm converges. 


Here we presented a simple example of a linear regression model. 
Depending on the data, the relationship may not be linear. In this case, a 


non-linear regression model should be considered using the same principle. 


Logistic regression is a classification method when the output is a discrete 
or binary variable (i.e., 0 or 1 / yes or no). Logistic regression has the same 
concept as the linear regression with the exception that it uses a logistic 
function or sigmoid function as a cost function and is based on 
probabilities. The logistic function is mathematically formulated as the 
equation given below: 

F(X)=L/(1+exp(-X)) 


The Sigmoid function is a logistic function where L is equal to 1. This 
function takes values between 0 and 1. It converts any value into ranges 
between 0 and, which is very useful in machine learning to map 
probabilities of input to belong to a specific class. Let’s consider the 
example of classifying images as cat image or non-cat image. Basically, 
when using a Sigmoid function, we set a threshold value to discriminate 
between a cat image or non-cat image. The image is classified in cat image 


or not if the predicted value is greater or not than the fixed threshold. 


In order to get values between 0 and 1, we apply the following 
transformation to the input X: 
Z=B,+B,X 


The regression is applied to the new variable Z. The cost function for a 
logistic regression is given then as follows: 
Cost function=-log{1/[1+exp(-Z)]} if the target value y is equal to 1 
Cost function=-log{1-1/[1+exp(-Z)]} if the target value y is equal to 0 
We can express the two equations into one compressed equation as follows: 
Cost function=-1/n}'[ylog(h(Z)) + (1-y)log(1-h(Z))] 
where h(X)=1/(1+exp(-X) 


Like the linear regression, an optimizer algorithm like the gradient descent 


can be applied to minimize the cost function. 


What is unsupervised learning? 


The Unsupervised paradigm learning is applied to determine the data’s 
masked structure pattern in order to learn more from the data. In contrast 
with supervised learning, there is neither a correct answer or a training 
dataset to learn from. In this case, only input dataset X is available. The 
algorithms are used to identify the patterns in data without explicitly 
providing them with many information other than the data itself. In short, 
no guidance is provided to the algorithm to learn from the data. The 
algorithm relies on similarities, patterns, and differences to learn from the 
data. There are different types of unsupervised learning, namely clustering, 


association, and anomaly detection. 


Clustering algorithms are applied to identify the inherent groupings in data. 
For instance, grouping patients by eating habits. Basically, clustering is 
ordering the unlabeled data into clusters. A cluster is an ensemble of data 
that are similar. Association is a learning method by a rule to be identified. 
This rule shows trends in the dataset; for example, patients with diabetes 
tend to have hypertension. The association describes patterns that occur 
together in a dataset. Anomaly detection, as its name suggests, it detects 
unusual data points within the dataset. This learning is practical to detect 
suspicious transactions or email, fraud detection, or errors within a system, 
among others. Overall, unsupervised learning is very useful for regression 
application where the expected output is completely unknown. This 
paradigm of learning provides valuable information about the data structure 
and patterns with no information supplied. We will cover, in this section, 


the major unsupervised learning algorithms. 


Clustering models rely typically on similarity measures defined as a 
Probabilistic or a Euclidean distance. Different algorithms are available for 
clustering such as K-means clustering, hidden Markov models, self- 
organizing maps, Gaussian mixture models, and Hierarchical clustering, 
among others. 

A widely used algorithm for unsupervised learning is K-means clustering. It 
belongs to the partitional category of algorithms that defines all clusters at 
once. The k-means algorithm groups similar data into a fixed number K of 
clusters according to their centroid (i.e., center). The cluster number k is a 
user-fixed parameter. The k-means algorithm determines a k number of 
centroids of each cluster and then associates data to the nearest cluster using 
the similarity measure. More precisely, the algorithm starts by forming an 
initial K centroid defined from a random selection of data. Every point of 
the dataset is allocated to the nearest center or centroid. Then, the centroids 
are recalculated for each cluster. This procedure is replicated several times 
as long as a convergence criterion has not been satisfied. The convergence 
criterion can be a very minimal change of the average data forming a cluster 
(i.e., centroids), or a minimal change of re-allocation of data to the clusters, 


or the maximum number of iterations has attained. 


This algorithm is mainly used because it is simple to understand, and its 
implementation is very straightforward. It is also computationally efficient. 
However, there are some downsides to this algorithm. It is only applicable 
if the mean is defined. In other words, it is applicable only for continuous 
data. For example, if the data is binary or categorial, another function 
representing the center should be used like the k-mode, which provides 
values frequency. This algorithm is sensitive to data and outliers because it 


is based on the mean to define the centroid. A slight change in data or if 


data contains an outlier, it leads to high variations. Another issue with the k- 
means algorithms is it can be inefficient when clusters do not have a 
rounded shape because it uses a distance to the centroid of a cluster to 
assign each point of the data to a certain class. This algorithm is also 
deterministic in the sense that each point fit in only one class. In reality, 
clusters may be overlapping, and the k-means algorithm is not able to 


define overlapping clusters. 


Hierarchical clustering, unlike the k-means clustering, defines clusters 
from pre-determined clusters. Depending on how algorithms define these 
clusters, two varieties of hierarchical clustering algorithms can be identified 
the agglomerative and divisive algorithms. The first type of hierarchical 
algorithms starts typically with each data point as an individual cluster. 
Then merge similar clusters iteratively into larger clusters until one cluster 
is formed. In contrast, divisive algorithms start with all data points as a 
single cluster called the root cluster than divide it iteratively into smaller 
clusters called child clusters. The divisive algorithms stop when every data 
point defines a separate cluster or what is called in singleton clusters. The 
hierarchical clustering suffers from the same drawback as the k-means 


clustering, which is its inability to define overlapping clusters. 


Gaussian mixture models are a probabilistic approach that allows 
overcoming some of the issues faced with the clustering algorithms. With 
the Gaussian mixture models, every cluster is described by its centroid (i.e., 
mean), covariance and its size or weight. Basically, Gaussian mixture 


models categorize data based on their distribution. 


Self-organizing maps are a very different approach than approaches 
described previously. Self-organizing maps are a category of artificial 
neural network that aims at decreasing the size of the input space. It is 
trained to provide a discretized description of the input space using 
competitive learning and neighborhood function to preserve the form of the 
data which is fed to the model. This category of the artificial neural network 
is completely different than the one cover in the next chapter. The later uses 
a correction learning paradigm to improve its prediction like 


backpropagation and gradient descent. 


Association, another unsupervised learning technic, aim at identifying data 
features that frequently occur together, and data features that are correlated. 
It helps answer the question of what the value of a data feature can tell 
about another feature. The directionality in the association is very 
important. If a data feature A shows a specific trend about another feature B 
it does not necessarily mean that the feature B will show a trend for the 
feature A. For example, if clients that buy a product A tends to buy a 
product B that does not mean necessarily that clients that buy the product B 
tends to buy the product A. The association rule algorithm uses different 
measures to evaluate performance of the rule: 1) support,2) confidence and 
,3) lift. In order to explain these evaluation metrics, let’s take, for example, 
two features A and B for which we want to identify the hidden relationship. 
The support is expressed as the frequency (percentage) of events where A 
and B occur together. It can be calculated as the fraction of events when A 
and B occur together. Confidence is computed as the fraction of events 
including the feature B to the number of events containing the feature A. 
The lift is computed as the fraction of the confidence and support. If the lift 
is greater than 1, the features A and B are positively correlated. If the lift is 


less than 1, the features A and B are negatively correlated. If the lift is equal 


to 1, then the features, A and B are not correlated. 


Overall, association rule algorithms are based on the frequency of 
occurrences across a dataset. The purpose is to detect association that 
occurs more frequently within the dataset than in a given random sampling. 
This approach is widely used in bioinformatics and basket data analysis. 
The later provides retailers with valuable information about items that 


clients tend to buy together. 


Anomaly detection is also called outlier detection. Outliers are simply the 
points or members of a dataset that shows a different behavior or have a 
different pattern than the majority of the dataset members. These outliers 
can be identified visualizing by plotting the data. Sometimes the outliers are 
just an error in the dataset. For example, an error in measurements or 
human error in typing information in the system. These outliers can be 
removed from the dataset to be analyzed. In some cases, these outliers may 
contain valuable information rather than just a simple error, for example, 
fraudulent transactions or email spams, or detecting malignant tumors. In 
this case, we want to understand the outlier behaviors to prevent them in the 
future and to build a more robust model. 

Anomaly detection can be either point, contextual, or collective. The first 
anomaly, point anomaly, is simply a point of a dataset that has a large 
distance from the other the rest of the dataset. The second type is the 
contextual anomaly, which is given in a specific context mainly used in 
timeseries datasets. For instance, a temperature above 30°C during the 
winter season in a Nordic country. The third type of anomaly, the collective 


anomaly, is an ensemble of dataset points that collectively shows a different 


pattern. In other words, an ensemble of different patterns occurring together 
is the anomaly. Each member of this ensemble occurred by itself; it is not 
necessarily an anomaly. This type of anomaly is also very used in timeseries 


datasets. 


The are several ways for anomaly detection. A simple statistical based 
approach is to identify the points that diverge from the distribution of the 
dataset. In machine learning, the k-nearest neighbor is a simple procedure 
employed for anomaly detection. This algorithm is based on the density of 
the data. The nearest points are determined by a distance such as the 
Euclidean distance or Mahalanobis distance. Another procedure employed 
is the k-means algorithm that relies on clusters. Data that does not belong to 


the clusters containing similar groups are considered anomalies. 


What is semi-supervised learning? 
This third paradigm of learning is a hybrid approach which uses both 


supervised and unsupervised learning technics. Semi-supervised learning 
makes use of unlabeled datasets aside from labeled datasets. Often, labeled 
datasets are not available, or only a few are available, which makes the 


unsupervised learning very useful. 


Semi-supervised learning is based on the concept of training models with 
labeled dataset first. Then apply models on the unlabeled dataset in order to 
build other models that learn from the produced datasets. The pseudo- 
learning is a simple approach to perform the semi-supervised learning. It 
starts by training a model on a training dataset, which is the labeled dataset. 
Next, it uses this model to predict the outputs of the unlabeled dataset. 
Then, it merges the labeled training dataset with the unlabeled dataset. It 
also merges the labels of the training dataset with the model output. New 
formed data is employed to fit the model. This way, the model performance 


is enhanced, and the learning process of the data structure is improved. 


What is reinforcement learning? 


Reinforcement learning is based on the trial and error process, where the 
model learns interactively. Like supervised learning, reinforcement learning 
uses a mapping function between the input and output. It uses feedback 
from the output to correct the mapping using a reward concept. Unlike 
unsupervised learning where the objective is to find similarities in the 
datasets, reinforcement learning has a goal of defining a function that would 
maximize a reward. This type of learning is used in developing game 
strategies, for example, to define the best actions to win the game. 
Reinforcement learning is very handy in solving a complex problem that 
contains several actions that are correlated to achieve a goal. Before we 
present some used algorithms in reinforcement learning, let’s first define the 


terminology used to define a reinforcement learning problem. 


The reinforcement learning relies on the principles of environments, agents, 
States, actions and rewards, the policy as well as value. We are going to 


explain each of these reinforcement learning components. 


Let’s start first by defining the agent, which is the core component of 
reinforcement learning. An Agent is a component that makes actions. 
Basically, it is an algorithm that performs actions or models a certain 


concept. It can be an algorithm that simulates a game. 


An action is an ensemble of several possible moves or activities that the 
agent can make or perform. The action is pre-defined, and the agent 


chooses among all these possibilities. Using the game example, action can 


be a list of directions such as moving right, left, up or down. It can also 


include the speed, fast or slow. 


The state is the current situation in which the agent is. It can be a specific 
location or moment that places the agent in a specific relation with other 
elements within its environment. 

The reward is the feedback that determines whether the action taken by the 
agent is a successful action or a failure action. In short, the reward is a 
measure that evaluates the success or failure of the agent’s operations. 
Environment is the world where the agents make the actions and operations. 
The agent state and action are fed to the environment that processes them. 
Then, it evaluates the reward. It returns the evaluated reward as well as the 


next state of the agent. 


The policy is a scheme which the agent follows to define the next action 
according to the current state. The policy relates states to action in order to 


map the high rewarding actions. 


Finally, the value represents the reward that an agent would have obtained if 


it performed an action in a specific state. 


Now the question is how reinforcement learning use these concepts? 
Basically, the agent performs an action in a current state. This action is 
evaluated by the environment that returns a reward and a state of the agent. 
According to this information, another action is taken by the agent. The 
process is repeated in order to maximize the reward. The goal in 


reinforcement learning is to define the best sequence of actions that the 


agent should take in order to achieve its goal and maximize the reward or 


what is conventionally known in optimization the objective function. 


Two types of reinforcement learning exist, namely the positive and negative 
reinforcement. The positive reinforcement consists of increasing the 
frequency of action when a state or a reward happens when this action is 
taken. In contrast, the negative reinforcement consists of increasing the 


frequency of an action when a reward is not received or avoided. 


Most commonly used algorithms are State-Action-Reward-State-Action and 
Q-learning algorithm. These algorithms are usually combined with a neural 


network to enhance their performance. They use neural networks as agents. 


How to create a model with machine learning? 


As discussed earlier in this section, different types of machine learning 
exist. In order to choose a specific approach, supervised, unsupervised, or 
reinforcement learning, the problem being solved and the type of data 
available guide to adopting a certain approach. Supervised learning comes 
handy when you already have data with known output, and you try to 
forecast the outcome of new data. For instance, you already have a product 
in the market, you know the patterns and profiles of your clients, and you 
try to predict if new clients would be interested in that product. In contrast, 
unsupervised learning paradigm is useful when you ignore the desired 
output from your data, for example, when marketing a new product. If a 
combination of labeled and unlabeled data is available, then it is best to 
combine both supervised and unsupervised paradigms through the semi- 
supervised approach. Reinforcement learning is best applicable in 
developing computer games, robotics, automatic learning of treatment 
policies for healthcare, or stock market online trading. It is best applied for 


interactive learning processes. 


The first step to start resolving a problem with machine learning is getting 
the data. At this step, knowledge of the problem should point toward where 
to get the data and what expectations regarding the form of the training 
dataset. The second step is identifying the features that should be used for 
learning. Redundancy in features might be helpful, and most current 
algorithms do not require independent features. Selection methods are 
useful tools to decrease the features being used or simply using an 
algorithm that is designed to deal with multiple features is preferable. This 


step requires an exploration of the date, estimating any missing data, and 


detecting the presence of outliers or noisy data. The third step is choosing 
an algorithm according to the convenient learning paradigm identified and 
the goal of the machine learning model. Often a particular algorithm would 
naturally be the best fit for the problem. However, it is a good practice to try 
many algorithms and select the best. The algorithm should be trained on the 
part of the data and tested against the other part for performance 
verification. There are different approaches to apply performance testing. 
One approach is to split the data into two equivalent datasets if it is a large 
dataset. The model can be trained on 50% of the dataset and tested on the 
other 50%. Another approach is to use 10-fold cross-validation, which 
consists of randomly splitting the data into 10 data blocks. In turn, the 
model is trained on 9 data block and tested on 1 one data block. This 
process is repeated many times, and performance testing should be used for 
each algorithm tested. This might be very computational expensive in 
particular for large dataset. This is one of the disadvantages of machine 
learning. One should consider finding a balance between simplicity and 
fitting the data. Another disadvantage is overfitting meaning that the model 
has higher performance on the training dataset and very low performance 


when tested on an independent dataset. 


In this chapter, we covered the major machine learning paradigms, namely 
semi-supervised, unsupervised, supervised, and finally, reinforcement. We 
presented some methods that are used for each of these types of learning. 
The disadvantage of the method presented, for example, linear or logistic 
regression for supervised learning is based on a linear correlation between 
the data features and the target output. However, the majority of real worlds 
applications this relationship is not linear. Therefore, we need other tools to 


model the relationship between the inputs and outputs. A widely used tool 


is artificial neural networks. Moreover, these networks are often combined 
with traditional methods to improve their performance, as was mentioned in 
reinforcement learning. This tool is presented in details in the following 
chapter. 


Chapter 2: Artificial Neural Networks 


This chapter discusses the integral aspect of artificial neural networks. It 
also covers their component in particular activation functions and how to 
train an artificial neural network, as well as the different advantages of 


using an artificial neural network. 


Definition of artificial neural network 


A widely used approach in machine learning, the employment of artificial 
neural network is inspired by the brain system of humans. The objective of 
neural networks is replicating how the human brain learns. The neural 
network system is an ensemble of input and output layers and a hidden 
layer that transforms the input layer into useful information to the output 
layer. Usually, several hidden layers are implemented in an artificial neural 
network. The figure below presents an example of a neural network system 
composed of 2 hidden layers: 
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Example of an artificial neural network 


Before going further and explaining how neural networks work, let’s first 
define what is a neuron. A neuron is simply a mathematical equation 
expressed as the sum of the weighted inputs. Let’s consider X={X,, x,, 


....X,,+ a vector of N inputs, the neuron is a linear combination of all inputs 


defined as follows: 


F(X= 1X5 X55 2:8, *. Xt )=W |X, +wx,+ coe bWy Xu 


With w,, w,,.--W,, is the weights assigned to each input. The function F can 


also be represented as: 

F(X)=WX, 
With W a weight matrix and X a vector of data. The second formulation is 
very convenient when programming a neural network model. The weights 
are determined during the training procedure. In fact, training an artificial 
neural network means finding the optimal weights W that provide the most 


accurate output. 


To each neuron, an activation function is applied the resulted weighted sum 
of inputs X. The role of the activation function is deciding whether the 
neuron should be activated or not according to the model’s prediction. This 
process is applied to each layer of the network. In the next sub-sections, we 
will discuss in details the role and types of activation functions as well as 


the different types of neural networks. 


What is an activation function and its role in neural 


network models? 


Activation functions are formulated as mathematical functions. These 
functions are a crucial component of an artificial neural network model. For 
each neuron, an activation function is associated. The activation function 
decides whether to activate the neuron or not. For instance, let’s consider 
the output from a neuron, which is: 

Y=)‘ (weight*input)+bias. 


The output Y can be of any value. The neuron does not have any 
information on the reasonable range of values that Y can take. For this 
purpose, the activation function is implemented in the neural network to 
check Y values and make a decision on whether the neural connections 


should consider this neuron activated or not. 


There are different types of activation functions. The most instinctive 
function is the step function. This function sets a threshold and decides to 
activate or not activate a neuron if it exceeds a certain threshold. In other 
words, the output of this function is 1 if Y is greater than a threshold and 0 
otherwise. Formally, the activation function is: 

F=’ activated’ or F=1; if Y> threshold 


F=’not-activated’ or F=0; otherwise. 


This activation function can be used for a classification problem where the 
output should be yes or no (i.e., 0 or 1). However, it has some drawbacks. 
For example, let’s consider a set of several categories (i.e., class1, class2, 


..., etc.) to which input may belong to. If this activation function is used 


and more than one neuron is activated, the output will be 1 for all neurons. 
In this case, it is hard to distinguish between the classes and decide into 
which class the input belong to because all neuron outputs are 1. In short, 
the step function does not support multiple output values and classification 


into several classes. 


Linear activation function, unlike the step function, provides a range of 
activation values. It computes an output that is proportional to the input. 
Formally: 

F(X)=WxX, where X is the input. 


This function supports several outputs rather than just a 1 or 0 values. This 
function, because it is linear, does not support backpropagation for model 
training. Backpropagation is the process that relies on function derivative or 
gradient to update the parameters, in particular, the weights. The derivative 
(i.e., gradient) of the linear activation function is a constant which is equal 
to W and is not related to changes in the input X. Therefore, it does not 
provide information on which weights applied to the input can give accurate 


predictions. 


Moreover, all layers can be reduced to one layer when using the linear 
function. The fact that all layers are using a linear function, the final layer is 
a linear function of the first layer. So, no matter how many layers are used 
in the neural network, they are equivalent to the first layer, and there is no 
point of using multiple layers. A neural network with multiple layers 
connected with a linear activation function is just a linear regression model 


that cannot support the complexity of input data. 


The majority of neural networks use non-linear activation functions 
because, in the majority of real-world applications, relations between the 
output and the input features are non-linear. The non-linear functions allow 
the neural network to map complex patterns between the inputs and the 
outputs. They also allow the neural network to learn the complex process 
that governs complex data or high dimension data such as images, audios, 
among others. The non-linear functions allow overcoming the drawbacks of 
linear functions and step functions. They support backpropagation (i.e., the 
derivative is not a constant and depends on the changes of the input) and 
stacking several layers (i.e., the combination of non-linear functions is non- 
linear). Several non-linear functions exist and can be used within a neural 
network. In this book, we are going to cover the most commonly used non- 


linear activation functions in machine learning applications. 


The sigmoid function is one of the most used activation functions within an 
artificial neural network. Formally, a sigmoid function is equal to the 
inverse of the sum of 1 and the exponential of inputs: 

F(X)=1/(1+exp(-X)) 


Outputs of a sigmoid function are bounded by 0 and 1. More precisely, the 
outputs take any value between 0 and 1 and provide clear predictions. In 
fact, when the X is greater than 2 or lower than -2, the value of Y is close to 


the edge of the curve (i.e., closer to 0 or 1). 


F(X) 





Sigmoid activation function 


The disadvantage of this activation function, as we can see from the figure 
above, is the small change in the output for input values under -4 and above 
4. This problem is called ‘vanishing gradient’ which means that the 
gradient is very small on horizontal extremes of the curve. This makes a 
neural network using the sigmoid function, learning very slow when they 


approach the edges and computational expensive. 


The tanh function is another activation function used that is similar to the 
sigmoid function. The mathematical formulation of this function is: 
F(X)=tanh(X)=[2/(1+exp(-2X)]-1. 


This function is a scaled sigmoid function. Therefore, it has the same 
characteristics as the sigmoid function. However, the outputs of this 
function range between -1 and 1, and the gradient are more pronounced 
than the gradient of the sigmoid function. Unlike the sigmoid function, the 
tanh function is zero-centered, which makes it very useful for inputs with 


negative, neutral, and positive values. The drawback of this function, as for 


the sigmoid function, is the vanishing gradient issue and computationally 


expensive. 


The Rectified Linear Unit function or what is known as ReLu function, is 

also a widely used activation function, which is computationally efficient. 

This function is efficient and allows the neural network to converge quickly 

compared to the sigmoid and tanh function because it uses simple 

mathematical formulations. ReLu returns X as output if X is positive or 0 

otherwise. Formally, this activation function is formulated as 
F(X)=max(0,X). 


This activation function is not bounded and takes values from 0 to +inf. 
Although it has a similar shape as a linear function (i.e., this function is 
equal to identity for positive values), the ReLu function has a derivative. 
The drawback of the ReLu is that the derivative (i.e., the gradient) is 0 
when the inputs are negative. This means as for the linear functions, the 
backpropagation cannot be processed, and the neural network cannot learn 
unless the inputs are greater than 0. This aspect of the ReLu, gradient equal 


to 0 when the inputs are negative, is called dying ReLu problem. 


To prevent the dying ReLu problem, two ReLu variations can be used, 
namely the Leaky ReLu function and the Parametric ReLu function. The 
Leakey ReLu function returns as output the maximum of X and X by 0.1. In 
other words, the leaky ReLu is equal to the identity function when X is 
greater than 0 and is equal to the product of 0.1 and X when X is less than 
zero. This function is provided as follows: 

F(X)=max (0.1*X, X) 


This function has a small positive gradient which 0.1 when X has negative 
values, which make this function support backpropagation for negative 
values. However, it may not provide a consistent prediction for these 


negative values. 


The parametric ReLu function is similar to the Leaky ReLu function, that 
takes the gradient as a parameter to the neural network to define the output 
when X is negative. The mathematical formulation of this function is as 
follows: 

F(X)=max (aX, X) 


There are other variations of the ReLu function such as the exponential 
linear ReLu. This function, unlike the other variations of the ReLu the 
Leaky ReLu and parametric ReLu, has a log curve for negative values of X 
instead of the linear curves like the Leaky ReLu and the parametric ReLu 
functions. The downside of this function is it saturates for large negative 
values of X. Other variations exist which all relies on the same concept of 


defining a gradient greater than 0 when X has negative values. 


The Softmax function is another type of activation function used differently 
than the one presented previously. This function is usually applied only to 
the output layer when a classification of the inputs into several different 
classes is needed. In fact, the Softmax function supports several classes and 
provides the probability of input to belong to a specific class. It normalizes 
outputs of every category between 0 and 1 then divides by their sum to 


provide that probability. 


Given all these activation functions, where each one has its pros and cons, 
the question now which one should be used in a neural network? The 
answer is simply having a better understanding of the problem in hand will 
help guide into a specific activation function, especially if the 
characteristics of the function being approximated are known beforehand. 
For instance, a sigmoid function is a good choice for a classification 
problem. In case the nature of the function being approximated is unknown, 
it is highly recommended to start with a ReLu function than try other 
activation function. Overall, ReLu function works well for a wide range of 
applications. It is an ongoing research, and you may try your own activation 


function. 


An important aspect of choosing an activation function is sparsity of the 
activation. Sparsity means that not all neurons are activated. This is a 
desired characteristic in a neural network because it makes the network 
learns faster and less prone to overfitting. Let’s imagine a large neural 
network with multiple neurons if all neurons were activated; it means all 
these neurons are processed to describe the final output. This makes the 
neural network very dense and computationally exhaustive to process. The 
sigmoid and the tanh activation functions have this property of activating 
almost all neurons, which makes them computationally inefficient unlike 
the ReLu function and its variations that cause the inactivation of some 
negative values. That is the reason why it is recommended to start with the 
ReLu function when approximating a function with unknown 


characteristics. 


What are the types of artificial neural networks? 


Several categories of artificial neural networks with different properties and 
complexities exist. The first and simplest neural network developed is the 
perceptron. The perceptron computes the sum of the inputs, applies an 


activation function, and provide the result to the output layer. 


Another old and simple approach is the feedforward neural network. This 
type of artificial neural network has only one single layer. It is a category 
that is fully connected to the following layer where each node is attached to 
the others. It propagates the information in one direction from the inputs to 
the outputs through the hidden layer. This process is known as the front 
propagated wave that usually uses what is called the activation function. 
This activation function processes the data in each node of the layers. This 
neural network returns a sum of weights by the inputs calculated according 
to the hidden layer’s activation function. The category of feedforward 
neural network usually uses the backpropagation method for the training 


process and the logistic function as an activation function. 


Several other neural networks are a derivation of this type of networks. For 
example, the radial basis function neural network. This is a feedforward 
neural network that depends on the radial basis function instead of the 
logistic function. This type of neural networks has two layers, wherein the 
inner layer, the features, and radial basis function are combined. The radial 
function computes the distance of each point to the relative center. This 
neural network is useful for continuous values to evaluate the distance from 


the target value. 


In contrast, the logistic function is used for mapping arbitrary binary values 
(i.e., 0 or 1; yes or no). Deep feedforward neural networks are a multilayer 
feedforward neural network. They became the most commonly used neural 
network types used in machine learning as they yield better results. A new 
type of learning called deep learning has emerged from this type of neural 


networks. 


Recurrent neural networks are another category that uses a different type 
of nodes. Like a feedforward neural network, each hidden layer processes 
the information to the next layer. However, outputs of the hidden layers are 
saved and processed back to the previous layer. Basically, the first layer, 
comprised of the input layer, is processed as the product of the sum of the 
weighted features. The recurrent process is applied in hidden layers. At 
each step, every node will save information from the previous step. It 
actually uses memory, while the computation is running. In short, the 
recurrent neural network uses forward propagation and backpropagation to 
self-learn from the previous timesteps to improve the predictions. In other 
words, information is processed in two directions, unlike the feedforward 


neural networks. 


A multilayer perceptron or multilayer neural network is a neural 
network that has at least three or more layers. This category of networks is 
fully connected where every node is attached to all other nodes in the 


following layers. 


Convolutional neural networks are typically useful for image 


classification or recognition. The processing used by this type of artificial 


neural network is designed to deal with pixel data. The convolutional neural 
networks are a multi-layer network that is based on convolutions, which 
apply filters for neuron activation. When the same filter is applied to a 
neuron, it leads to an activation of the same feature and results in what is 
called a feature map. The feature map reflects the strength and importance 


of a feature of input data. 


Modular neural networks are formed from more than one connected 
neural network. These networks rely on the concept of ‘divide and 
conquer.’ They are handy for very complex problems because they allow 
combining different types of neural networks. Therefore, they allow 
combining the strengths of a different neural network to solve a complex 


problem where each neural network can handle a specific task. 


How to train an artificial neural network? 


As explained at the beginning of this chapter, neural networks compute a 
weighted sum of inputs and apply an activation function at each layer. Then 
it provides the final result to the output layer. This procedure is commonly 
named as forward propagation. In order to train these artificial neural 
networks, weights need to be optimized to obtain the optimal weights that 
produce the most accurate outputs. The process of the training an artificial 
neural network is as follows: 

1. Initialize the weights 
. Apply the forward propagation process 
. Evaluate the neural network performance 
. Apply the backward propagation process 
. Update the weights 
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. Repeat the steps from step 2 until it attains a maximum 
number of iterations, or neural network performance does not 


improve. 


As we can see from the steps of training an artificial neural network 
presented above, we need a performance measure that describes how 
accurate the neural network is. This function is called the loss function or 
cost function. This function can be the same as the cost function we 


presented in the previous chapter: 
J=(1/ N) 3 eaer Yaga) 
Where N: the number of outputs, Y,.aicea the output and Y,,,. the true 


value of the output. This function provides the error of the neural network. 


Small values of J reflect the high accuracy of the neural network. 


So far, we defined loss function and how the neural network works in 


general. Now, let’s go into the details for each step of the training process. 


Let’s consider a set of inputs X and outputs Y. We initialize W (i.e., 
weights) and B (i.e., bias) as a null matrix. The next step is to apply the feed 
forward propagation that consists of feeding each layer of the artificial 
neural network with the sum of the weights by the inputs and the bias. Let’s 
consider that we have two layers. We can calculate the first hidden layer’s 
output using the following equation: 
Z1=W1*X+b1 

where W1 and b1 are the parameters of the neural network as the weights 
and bias of the first layer, respectively. 
Next, we apply the activation function F1, that can be any activation from 
the function presented previously in this chapter: 

A1=F1(Z1). 
The result is the output of the first layer, which is then is feed to the next 
layer as: 

Z2=W2*A1+b2 
with W2 and b2 are the weights and bias of the second layer, respectively. 
To this result, we apply an activation function F2: 

A2=F2(Z2). 
Now A2 is supposed to be the output of the artificial neural network. The 
activation function F1 and F2 might be the same activation function or 
different activation function depending on the dataset and the expected 


output. 


After the feedforward propagation, we compare the neural network output 
against the target output with the loss function. It is highly likely the 


difference between the estimated output and the actual values at this stage is 
very high. Therefore, we have to adjust the weights through the 
backpropagation process. We calculate the gradient of each activation 
function concerning biases and weights. We start by evaluating the 
derivative of the last layer, then the layer before this layer on so on until the 
input layer. Then update the weights according to the gradient or the 
derivative of the activation function. Applying these steps to our example of 
two layers neural network it provides: 

W2=W2-a*(dF2(W, b)/dW) 

B2=b2- a*(dF2(W, b)/db) 

W1=W1-a*(dF2(W, b)/dW) 

B1=b1- a*(dF2(W, b)/db) 
The parameter a is the learning rate parameter. This parameter determines 
the rate by which the weights are updated. The process that we just describe 
here is called the gradient descent algorithm. The process is repeated until it 
attains a pre-fixed maximum number of iterations. In chapter 4, we will 
develop an example to illustrate a perceptron and multi-layer neural 
network by following similar steps using Python. We will develop a 
classifier based on an artificial neural network. Now, let’s explore the pros 


of using an artificial neural network for machine learning application. 


Artificial neural network: pros and cons of use 


Nowadays, artificial neural networks are applied in almost every domain. 
Research in the domain of artificial neural networks is very active, and 
several neural networks immerged to take advantage of the full potential of 
this Artificial intelligence approach. Artificial neural networks have several 


advantages. 


Artificial neural networks are able to map structures and learn from the data 
faster. They are also able to map the complex structure and connections that 
relate the outputs to the input datasets, which is the case in many real-life 
applications. Once an artificial neural network is developed and trained, it 
can be generalized. In other words, it can be applied to map relationships 
between data that it has not been exposed to or to make predictions for new 
datasets. Moreover, the artificial neural network does not make any 
assumptions of the structure or the distribution of the input data. It does not 
impose specific conditions on the data or assumptions on the relationship in 
the data, unlike traditional statistical methods. The fact that artificial neural 
networks can handle a large amount of data makes them an appealing tool. 
Artificial neural networks are a non-parametric approach which allows 
developing a model with a reduced error that is caused by the estimation of 
the parameters. Although these appealing characteristics of artificial neural 


networks, they suffer from some drawbacks. 


The downside of artificial neural networks is that they often operate as a 
black box. This means that we cannot fully understand the relationship 
between the inputs and outputs and the interdependence between specific 


input variables and the output. In other words, we cannot detect how much 


each input variable impacts the output. The training process can be 
computationally inefficient. We can overcome this problem by using 
parallel computing and taking advantage of the computation power of 


computers by using proper coding. 


Chapter 3: Python for Machine Learning 


In order to use machine learning, we need a programming language to 
provide instruction to the machine to execute the code. In this section, we 
are going to learn the basics of the Python language, how to install and 
launch python. We are also going to learn some Python syntax and some 
useful tools to run Python. We also cover some basic Python libraries that 
useful for machine learning. These libraries will be used in the next chapter 
to develop machine learning applications. First of all, why we would use 


Python and not another programming language? 


Why use Python for machine learning? 


Python is a programming language extensively used for many reasons. One 
main reason it is a free and open source language, which means it is 
accessible for everybody. Although it is free, it is community-based 
language, meaning that is developed and supported by a community that 
gathers its effort through the internet to improve the language features. 
Other reasons people would use Python are 1) quality as a readable 
language with a simple syntax, 2) program portability to any operating 
system (e.g. Windows , Unix) without or with little modifications,3)Speed 
of execution: Python does not need compilation and run faster than similar 
programming languages, 4) Component integration which means that 
Python can be integrated with other programs, can be called from C and 
C++ libraries, or call another programming language. Python comes with 
basic and powerful standard operations as well as advanced pre-coded 
libraries like Numpy for numeric programming. Another advantage of 
Python is automatic memory management and does not require variable and 
size declaration. Moreover, Python allows developing different application 
such as developing Graphical User Interface (GUI), doing numeric 
programming, do game programming, database programming, internet 
scripting, and much more. In this book section, we will focus on how to do 
numeric programming for machine learning applications and how to get 
started with Python. 


How to Get started with Python? 


Python, a scripting language, and like any other programming languages, it 
needs an interpreter. The latter is a program which executes other language 
programs. As its name indicates, it works as an interpreter for computer 
hardware to execute the instructions of a Python programming. Python 
comes as a software package and can be downloaded from Python’s 
website: www.python.org. When installing Python, the interpreter is usually 
an executable program. Note that if you use UNIX and LUNIX, Python 
might be already installed and probably is in the /usr directory. Now that 


you have Python installed let’s explore how we can run some basic code. 


To run Python, you can open your operating system’s prompt (on Windows 
open a DOS console Window) and type python. If it does not work, it 
means that you don’t have python in Shell’s path Environment variable. In 
this case, you should type the full path of the Python executable. On 
Windows, it should be something similar to C:\Python3.7\python and in 
UNIX or LUNIX is installed in the bin folder: /usr/local/bin/python (or 
/usr/bin/python). 
When you launch Python, it provides two lines of information with the first 
line is the Python version used as in the example below: 
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit 
(AMD64)] :: Anaconda, Inc. on win32 
Type "help", "copyright", "credits" or "license" for more 
information. 
>>> 


Once a session is launched, Python prompts >>> which means it is ready. It 
is ready to run line codes you write in. The following is an example of 
printing statement: 

>>> print('Hello World !') 

Hello World ! 

>>> 
When running Python in an interactive session as we did, it displays the 
results after >>> as shown in the example. The code is executed 
interactively. To exit the interactive Python session type Ctrl-Z on Windows 


or Ctrl-D on Unix/Lunix machine. 


Now we learned how to launch Python and run codes in an interactive 
session. This is a good way to experiment and test codes. However, the 
code is never saved and need to be typed again to run the statement again. 
To store the code, we need to type it in a file called module. Files that 
contain Python statements are called modules. These files have an extension 
‘py.’ The module can be executed simply by typing the module name. A 
text editor like Notepad++ can be used to create the module files. For 
instance, let's create a module named text.py that prints ‘Hello World,’ and 
calculates 3/2. The file should contain the following statements: 

print (‘Hello World! ') 

print (‘342 equal to ' 3**2) 


To run this module, in the operating system’s prompt, type the following 
command line: 


python test.py 


If this command line does not work, you should type the full path of 
Python’s executable and the full path of the test.py file. You can also change 
the working directory by typing cd full path of the test.py file, then type 
python test.py. Changing the working directory to the directory where you 
saved the modules is a good way to avoid typing the full path of the 
modules every time you are running the module. The output is: 
C:\Users>python C:\Users\test.py 
Hello World! 
3/42 equal to 9 


When we run the module test.py, the results are displayed in the operating 
system’s prompt and go away as the prompt is closed. To store the results in 
a file, we can use a shell syntax by typing: 


python test.py > save.txt 


The output of test.py is redirected and saved in the save.txt file. 

In the next sections, we are going to learn Python syntax. For now, we are 
going to use the command line to explore Python syntax. Later in the 
current chapter, we will learn how to set and use some powerful platforms 


for Python programming. 


Python syntax 


Before we learn some Python syntax, we are going to explore the main 
types of data that can be used in Python and how a program is structured. A 
program is a set of modules which are a series of statements that contains 
expressions. These expressions create and process objects which are 


variables that represent data. 


Python Variables 

In Python, we can use built-in objects, namely numbers, strings, lists, 
dictionaries, tuples, and files. Python supports the usual numeric types the 
integer and float as well as complex numbers. Strings are character chain 
whereas lists and dictionaries are an ensemble of other objects that can be a 
number or a string or other lists or dictionaries. Lists and dictionaries are 
indexed and can be iterated through. The main difference between lists and 
dictionaries is the way items are stored and how they can be fetched. Items 
in a list are ordered and can be fetched by position whereas they are stored 
and fetched in dictionaries by key. Tuples like lists are positionally ordered 
set of objects. Finally, Python allows also creating and reading files as 
objects. Python provides all the tools and mathematical functions to process 
these objects. In this book, we will focus on the number variables and how 
to process them as we won’t need the other variables for basic machine 


learning applications. 


Python does not require variable declaration, or size or type declaration. 


Variables are created once they are assigned a value. For example: 


>>> x=5 


>>> print (x) 


5 
>>> x= 'Hello World !' 
Hello World ! 


In the example above, x was assigned a number then it was assigned a 
string. In fact, Python allows changing the type of variables after they are 
declared. We can verify the type of any Python object using the type() 


function. 


>>> x, y, Z=10,'Banana,2.4 

>>> print (type(x)) 

<class ‘int '> 

>>> print(type(y)) 

<class 'str '> 

>>> print (type(z)) 

<class ‘float '> 
To declare a string variable, both single and double quotes can be used. 
To name a Python variable, only alpha-numeric characters and underscores 
can be used (e.g., A_9). Note that the variable names are case-sensitive and 
should not start with a number. For instance, price, Price, and PRICE are 
three different variables. Multiple variables can be declared in one line, as 


seen in the example above. 


Number Variables 
Python allows three numeric types: int (for integer), float and complex. 
Integers are positive or negative numbers without decimals of unlimited 
length. Floats are negative or positive numbers with decimals. Complex 
numbers are expressed with a ‘j’ for the imaginary part as follows: 

>>> x=2+5j 

>>> print(type(x)) 

<class 'complex '> 


We can convert from one number type to another type with int(), float() and 
complex() functions. Note that you cannot convert a complex number to 
another type. 
Python has built-in mathematic operators that allow doing the basic 
operations such as addition, multiplication, and subtraction. It also has the 
power function. No, if we want to process a set of values, we would want to 
store them in one single object as a list. To define a list, we type the set of 
values separated by a comma between square brackets: 

>>> A=[10,20,30,40,50] 


We can select one element by typing the element index between the square 
brackets: 

>>> print (A[1]) 

20 


We can also slicer notation to select several elements. For example, 


displaying the 2™ to 4" element: 
>>> print(A[1:4]) 


[20,30,40] 


Note that indexing in Python start with 0 that is the index of the first 
element is 0. When using the slicer notation, the element of the second 
index is not included as in the example above. The value of A[4] is 50 and 
is not included in the output. To verify the dimension of an array, the len() 
function can be used. 

The disadvantage of using lists to store a set of variables is Python does not 
allow to apply the mathematical operations on lists. Let’s say we want to 
add a constant variable to the list X we created. We have to iterate over all 
the list elements and add the constant variable. However, there is a Numpy 
library that allows to create an array of the same type and do the basic 
mathematical operations. The Numpy arrays are different from the basic list 
arrays of Python as the Numpy arrays allow only to store variables of the 
same type. The Nympy library is useful in machine learning to create input, 


output variables, and perform necessary calculations. 


In order to be able to exploit the built-in function of the Numpy library, we 
must import the library into the workspace by typing: 


>>> import numpy as np 


Use the command pip -install "numpy" to install this toolbox, if it is not 
already installed in the system. 
To create an array, we type: 

>>> A=np.array([10,20,30,40]) 


Now, we can add, multiply or subtract a constant value from the array X by 


using the simple mathematical operators: 


>>> X=np.array([1,2,3,4]) # Creating a vector 
>>> print(X) 

[1234] 

>>> X=X+5 # Adding 5 to all elements 

>>> print(X) 

[6 789] 

>>> X=X*10 # Multiplying all elements by 10 
>>> print (X) 

[60 70 80 90] 

>>> X=X-10 # Subtracting 10 from all elements 
>>> print (X) 

[50 60 70 80] 

>>> X=XK**2 # Square of all elements 

>>> print (X) 

[2500 3600 4900 6400] 


Use the functions max() and min() to arrive at the array’s minimum and 
maximum values. The function sum() provide the sum of elements of the 
same array. We can also apply these operators (+,*, -,/) on two arrays. These 
operators are applied element by element when executed on arrays such as 
in the example below: 

>>> Y=np.array([1,2,3,4]) # Create a second array 

>>> print (Y) 

[1234] 

>>> X+Y 

array([2501, 3602, 4903, 6404]) 

>>> X*Y 

array(2500, 7200, 14700, 25600]) 


>>> X/Y 

array([2500. , 1800. , 1633.333333, 1600. ]) 
>>> X-Y 

array([2499, 3598, 4897, 6396]) 


Usually, in machine learning, we have multiple features of inputs. For 
example, if a machine learning program predicting human age according to 
the height and weight to be developed and a 100 of height and weight 
records form the dataset. It is usually more convenient to store these values 
as a matrix where each row is a record of height and weight. That is if the 
dimension of the matrix is 100 by 2. To do so in Python, we can create a 
multi-dimension array as follows: 

>>>A=np.array({[10,20],[30,40]]) 


In this example A is 2x2 matrix where the first line is [10,20] and the 
second line is [30,40]. To select an element, we use the same indexing as 
for the array as follows: 

>>> X[0] # first row of the matrix 

array([{10,20]) 

>>> X[0][1] # first row second column of the matrix 

20 


To verify the matrix’s dimension, you can just use the shape() function as 
follows: 

>>> X.shape 

(2,2) 

>>> X.shape[0] # number or rows 

Z 


>>> X.shape[1] # number of columns 
2 


We will cover in details the Numpy library later in this chapter. 


In the next section, we are going to learn how to iterate through an array or 


matrix and how to use the if else statements. 


If test and loops in Python 
Before learning how to code loops and If statements, it is important to 
mention that indentations are very crucial when programming with Python. 
Indentations in Python is used to indicate a block of code. If you skip 
indentation, it will give an error: 
>>> if 10>5: 
... print(‘YES') 
File "<stdin>", line 2 
print("YES') 
A 


IndentationError: expected an indented block 


Therefore, whenever using a loop or If the statement, you must use 
indentation for the instructions used within these statements. 
If statements allow running a set of instructions if a condition is met. It can 


contain other if statements. 


The general format is as follows: 
if <condition1>: # if test 
<statements1># block of instructions to run when condition! is true 


elif <condition2>: # optional or additional condition 


<statements2>#Instructions to run when condition2 is true 
else: 


<statements3> Instructions to run otherwise 


For example, let’s verify if all values stored in X are superior to 10 and 
display true otherwise display false. 
>>> if all(X>10): 
print(‘true') 
else: 


print(‘false') 


true 


Python allows two loop commands, namely the while loop and for a loop. 
The for loop is used when the number of iterations is known beforehand. 
On the contrary, the while loop is used when the number of iterations is 
unknown and executes a statement as long as a condition is satisfied. They 
both yield the same results but as a rule of thumb always use for a loop 
when you know exactly the number of iterations. The reason behind this 
rule is to avoid infinite loops. 
The general format of a for loop is: 

for <element> in <object>: # assign object items to element 


<statements> # block of statements 


Python allows stopping running a for loop if a condition is satisfied and an 
optional else block that runs once the loop is over. A continue statement is 
also optional that forces the loop to go from the top and skip the code if a 


condition is satisfied. The format is as follows: 


for <element> in <object>: # assign object items to element 

<statements1> # block of statements 

if <condition1>: break # Exit loop and skip else 

if <condition2>: continue # Go to the top and skip the next block 
else: 


<statements2> # if condition1 is not satisfied and did not hit break 


Below is a for loop example, displaying each element in the array X that we 
created before: 

>>> foriin X: 

print (i) 

2500 

3600 

4900 

6400 


Because the loop we coded is simple we can type it in one single line as 
follows: 

>>> for iin X: print(i) 

2500 

3600 


4900 
6400 


The while loop syntax is similar to the for loop with the exception that the 
header is a test or condition that force the loop to run until that condition is 


no longer satisfied. The general format of While loop is: 


while <constraint>: # condition to run the loop 
<statements> # body of the loop 
else: 


<statements> # block to run when exit the loop 


Like the for loop, we can add if tests to exit or continue the loop with break 
and continue statements as follows: 

while <condition1>: # assign object items to element 

<statements1> # block of statements 

if <condition2>: break # Exit loop and skip else 

if <condition3>: continue # Continues the loop from the loop top 

else: 
<statements2># if condition2 is not satisfied and did not hit 
break 


For example, we let’s display values until the variable j is no longer less 
than 5 and we are going to exit the loop if i is equal to 4: 
>>>j=1 
>>> while j<5: 
print(j) 
if j==4: 
break 
je=L 


BRWNO FP: 


The statements i+=1 increments the values of i by 1 at each iteration. It is 
very important to increase the value of i; otherwise, the loop will never exit. 
Now that we have learned how to manipulate arrays and use loops, in the 
next section, we will learn how to develop functions that allow running the 


same code multiple times for different inputs. 


Functions in Python 
Before diving into function coding, it is worthwhile to know how to 
document the codes in order to help understand the program and make it 
more readable. Usually, this is done by making comments in the script. 
Python recognizes comments lines according to the symbol ‘#’ at the 
beginning of a line or statement to indicate to the interpreter that it is a 
comment. When executing the program, Python will ignore everything that 
follows # symbol. For example: 

>>> print ("Hello World! ') # This a test comment 

Hello World ! 


To add multiple comments, # should be added at each line: 
>>># This is comment number 1 


>>> # This is comment number 2 


We can also add multiple comments in different lines by using """ at the 
first line and at the last line: 
SSS 
.. This a comment 
.. Example expanding 


.. on three lines 


LAR AARI 


Functions are very useful when you have repetitive tasks to run ina 
program. Instead of typing the same code several times with different object 
values, we can call a function with different object values as inputs several 
times. Python already has some useful built-in function like the one we used 
so far in the examples: print(), max(), etc.. Now we are going to learn how 
we can define and create different functions. The general format of a 
function is: 

def function_name (inputs): 


<statements> 


To call a function, we can use simply the following statement 
function_name (inputs). 
For example, let’s develop a simple function which displays ‘Hello world!’. 
>>> def first_function(): 
print(‘'Hello World!) 


>>>first_function() 
Hello World! 


Function, as shown in the general format, can take inputs and return outputs 
with the return statement. For instance, we can create a function that 
computes, prints, and returns the sum of two variables. 
>>> def sum_function(a,b): 
c=a+b 
print(' The sum of', a ‘and’, b'is:', c) 


return C 


sum_function(3,5) 
The sum of 3 and 5 is: 8 
8 


Note that a function can take as many as inputs as needed separated by 
comma and returned value can be stored in an object as follows: 

>>> c=sum_function(3,5) 

8 

>>> print(c) 

8 


We can also create a function with default input values: 
>>> def sum_function(a=2, b=1): 


return a+b 


By calling this function without any inputs, it will run with the default 
values 

>>> c=sum_function() 

>>> print (c) 

3 
Functions can be saved in a separate file with an .py extension as a module. 
A module general contains several functions and can be imported to the 


working space using the import statement. 


We can save, using a text editor, the function sum_function() we defined 
before in a file called mymodule.py. To use the function, we import 
mymodule then call the function as follows: 


>>> import mymodule 


>>> mymodule.sum_function() 
3 


Modules can also contain variables of any type. In the file mymodule.py, 
we save, for example: 
My_list=[ 1,2,3,4] 


After importing the mymodule, we can access the list defined as follows: 
>>>a=mymodule.My_list 


>>>print(a) 


We can also create alias for the module when they are imported. For 
example: 

>>> import mymodule as md 

>>>a=md.My_list 


>>> print(a) 


Now that we have learned the basic syntax of Python and how to develop 
modules, let’s explore some useful platform that allows interactive Python 


programming easily. 


Useful Platforms for Python Programming 


Now, let’s learn how to use some tools that allow interactive programming 
and also save the working environment. One powerful platform available is 
Anaconda. This platform is free and can be downloaded from the website 
anaconda.com/distribution. This platform comes with multiple open source 
packages and the Pip package to install other packages automatically using 
the pip install command. To run Python with Anaconda, you can run the 
Anaconda prompt and type python like how we used the Operating’s system 
prompt or using Jupyter Notebook, which we will use throughout the rest of 
the book. 


The web application allowing interactive programming is the Jupyter 
Notebook, which is open source. It supports different programming 
languages, including Python. It can be used for different applications like 
data cleaning, statistical modeling, and machine learning, among others. It is 
available on the website www.jupyter.org. However, as mentioned before, 
Jupyter Notebook is available within the Anaconda platform and can be 
launched from the Anaconda Navigator. On windows research bar, type 
Anaconda and open Anaconda Navigator and Launch Jupyter Notebook. 
Then, you can select Python 3 in the New tab under Notebook option. This 


should redirect you to a new tab in the internet Navigator as follows: 
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Jupyter Notebook allows running Python code interactively as we did with 
the Operating’s prompt system but also to save notebook to use it later or 
export it as Python file or as other formats. In the file menu, there is a 
download as an option which provides different formats. It allows exporting 
the notebook in HTML, LaTeX, PDF, RevealJS, Markdown, ReStructed 


Text or Executable scrip formats. 


To run the code in a cell, you can click ‘run’ tab or type Shift-Enter. By 
default, when you open a new notebook, there is only one cell. When you 
run code in this cell, it creates automatically another cell and so on. All 
variables created in the cells are shared in the environment. This is useful 
when importing libraries. In short, you don’t have to import the libraries 
every time you want to use them once they are imported in the beginning or 
redefine the variables. 


The Jupyter notebook is a very handy tool to share live code with other 
parties. Os includes several options that support that. For example, it allows 
deleting an output of a cell which allows for cleaning the code when sharing 


it with other parties. Now, that you have learned the basic syntax of Python 
and how to use Jupyter Notebook, in the next section, we are going to 


present some useful and widely used Python libraries. 


What are the Python libraries that are useful for 
machine learning applications? 


The Numpy library that we already explored in this chapter is a powerful 
library for scientific and numerical computing in Python. This library 
allows processing arrays. To install this library, you can use the following 
statement in Jupyter Notebook: 


pip install numpy 


Another widely used library is pandas, which fosters the easy management 
of data structures. It is also a Python package which is an open source, 
administering a data analysis tool that has a high performance. The pandas 


package uses the features available of the NumPy library. 


The Pandas library allows for handling of dataframes objects. It provides 
tools to import data into data objects from any file formats. It provides tools 
to detect missing data as well as reshaping datasets. It allows deleting and 
inserting new elements from or to a dataset. It also allows to aggregate, 
transform the data merging, and joining datasets. 

The Pandas package supports three structures of data, which are series, 
dataframe, and Panel. Series are a one-dimensional homogenous array like 
arrays in the Numpy library. Dataframe is multidimensional data that has a 
tabular structure that contains a series of different types. They are indexed 
as rows and columns. This is the widely used structure and the one we will 
use in the machine learning applications in the next chapter of this book 


along with the series. The panel is generally a three-dimensional array. 


To use the pandas library, we need to import it first: 


import pandas as pd 
Note here we imported pandas library and we gave the module an alias pd. 
To use any function from this library, we need to access it as 
pd.function_name. Let’s create an empty series. 

Test_series=pd.Series() 

print(Test_series) 

The output is : 

Series([], dtype: float64) 


The result is an empty series object with has a float type. We can create a 
serie object from array as follows: 
import numpy as np 
data=np.array([0,1,2,3,4]) 
test_series1=pd.Series(data) 
print(test_series1) 
The result this time is as follows: 
00 
11 
Zo 
a3 
44 
dtype: int32 
Note here that the first column is the index of each element of the array and 


second column are the data and the data type is integer (int32). 


The dataframe structure is in tabular form. As we did with series, we can 
create a dataframe as follows: 
Test_dt=pd.DataFrame() 


print(Test_dt) 


The returned dataframe object is as presented below: 
Empty DataFrame 
Columns: [] 
Index: [] 
We usually generate a new dataframe object using the previous data as 
follows: 
data={'Age': [20,30,40,50,60,70,80], 
"Name':['Bob','Brian','John','Chris',,James','Steve','Peter'] 
} 
Indiv_dt=pd.DataFrame(data) 
print(Indiv_dt) 


The dataframe object is as presented below: 


Age Name 

0 20 Bob 
1 30 Brian 
2 40 John 
3 50 Chris 

4 60 James 

5 70 Steve 

6 80 Peter 


We can retrieve a column from the dataframe by using the index of the 
column. Let’s print, for example, the age column: 
print(Indiv_dt['Age']) 


This would return: 


0 


1 
2 
3 


20 
30 
40 
50 


60 
70 
80 


Name: Age, dtype: int64 


We can delete and add columns or rows with function drop(), pop() or 


del(). We can also visualize the first 5 rows by calling the head() function: 
print(Indiv_dt.head()) 


This returns the first rows of the dataframe: 


RWN F © 


Age Name 
20 Bob 
30 —«Brian 
40 John 
50 = Chris 
60 James 


We can visualize the last elements with the tail() function : 
print(Indiv_dt.tail(Q)) 


It returns the 5 last elements of the dataframe: 


2 


Age Name 


40 


John 


3. 50° Chris 


4 60 James 

5 70 Steve 

6 80 Peter 

These functions are very handy to explore large dataset. We can check the 
size with the size function: 


Indiv_dt.size 


It returns 14, which is the number of elements of the dataframe object. In 
this example, it has 7 values of the variable age and 7 values of the variable 
name. For more accurate information about the shape of the dataset, we can 
use the shape function. 

Indiv_dt.shape 


This function returns (7,2) which we have 7 rows and 2 columns. 


Pandas library has the basic statistics functions listed below: 
e Sum of values as sum() 
e Cumulative sum as cumsum() 
e Cumulative product as cumprod() 
e Standard deviation as std() 
e Average as mean() 
e Median as median() 
e Minimum value as min() 
e Maximum value as max() 
e Absolute value as abs() 
e Mode of the dataset as mode() 


The pandas library provide a function that describes the dataframe, which is 
the describe() function. For example, let’s apply this function to the 
dataframe we created: 
print(Indiv_dt.describe()) 
This function returns: 
Age 
count 7.0 
mean 50.0 
std 21.6 
min 20.0 
25% 35.0 
50% 50.0 
75% 65.0 
max 80.0 
The describe function returns the basic statistics of the dataframe. Note here 
that we have two variables or two columns in the dataframe. The describe 
provided the summary statistics only for the age because it is a numerical 


variable. The name column was ignored. 


Functions can be applied to a subset of all rows of a dataframe using the 
function apply(). This function allows applying a function to the rows of a 
dataframe. 

We can reindex the columns of the dataframe as another dataframe with the 
reindex function. We can also rename the labels of the columns with the 


rename() function. 


The pandas library is very useful in detecting missing values. In data 


analysis and machine learning, detecting missing values is a crucial step 


because it affects the analysis and accuracy of the final results. Let’s create 
a dataframe with missing values and let’s present how these data can be 
handled with pandas: 
dt=pd.DataFrame(np.random.randn(5,3),index=[1,2,3,4,5],columns= 
[1,2,3]) 
dt=dt.reindex([1,2,3,4,5,6,7,8]) 
The first statement creates a dataframe with 5 rows and 3 columns from 
random values. By reindexing the dataframe, we created missing values as 


presented in the output table below: 


1 2 3 
-2.29 0.18 0.89 
-0.72  -0.53 -1.72 

0.75 0.95 0.46 
-0.33 141 1.17 
0.9 12 08 
NaN NaN NaN 
NaN NaN NaN 
NaN NaN NaN 


On AU RW NY 


Pandas library has to major function isnull() and notnull() to check the 
presence of missing values. If we use the isnull() function, it returns False 


or True for each element in the dataframe as follows: dt.isnull() 


1 2 3 
1 False False False 
2 False False False 
3 False False False 
4 False False False 
5 False False False 
6 ‘True True True 
7 


True True True 


8 ‘True True True 


This function returns True for the missing data and false otherwise. The 
notnull() function would return an opposite output, which means it returns 
False for the missing data and True otherwise. By applying notnull() 


function to the dataframe, we created it returns: 


1 2 3 
True ‘True True 
True ‘True True 
True ‘True True 
True ‘True True 
True ‘True True 


False False False 


False False False 


On AU Rh WN = 


False False False 


If we apply the basic mathematical operation like the sum function, it will 
return NaN as a result of missing values present in the data. Therefore, we 
need either to remove or estimate the missing values. The pandas library 
offers two function to fill the missing values namely fill or pad and backfill 
methods. To apply these methods, we should use the fillna() function as 
follows: 

print (dt.fillna(method='pad')) 
This function returns: 

il 2 3 
1 -2.289485 0.178246 0.894550 
2 -0.724545 -0.531302 -1.720398 
3 0.745967 0.954951 0.463451 
4 -0.330040 1.407060 1.169614 
5 -0.330040 1.407060 1.169614 


6 -0.330040 1.407060 1.169614 
7 -0.330040 1.407060 1.169614 


This uses an average calculated from the first elements to estimate the 
missing values in the dataframe. If we apply the backfill method: print 
(dt.fillna(method="backfill')), it would return: 
ds. <2 3 
1 -2.289485 0.178246 0.894550 
2 -0.724545 -0.531302 -1.720398 
3 0.745967 0.954951 0.463451 
4 -0.330040 1.407060 1.169614 
is) NaN NaN NaN 
6 NaN NaN NaN 
7 NaN NaN NaN 


The method shown here does not estimate, in this case, the missing values 

because it starts with the last elements in the dataframe to estimate missing 

values. In this case, all missing values are at the end of the dataframe, 

which yield to NaN values for the last elements of the dataframe. 

We can delete the missing using the dropna() function as follows: 
dt.dropna() 


This function would return the following dataframe: 


i 2 3 
1 -2.289485 0.178246 0.894550 
2 -0.724545 -0.531302 -1.720398 
3 0.745967 0.954951 0.463451 
4 -0.330040 1.407060 1.169614 


The pandas library allows merging and joining several dataframe objects 
using the merge() function. Let’s create two dataframe objects. 
import pandas as pd 
male_list={'Name': ['Brian','Peter','Steve',"Mark','Josh'], 
‘Age’: [100,20,30,40,50] 
} 
female_list={ 'Name’: ['Alice'’,"Helen’,'Fallon’,'Silvia','Jane'], 
'Age':[10,20,30,40,50] 
i 
dt1=pd.DataFrame(male_list) 
dt2=pd.DataFrame(female_list) 
Two dataframe objects we create are as below: 
Name Age 
0 Brian 100 
1 Peter 20 
2 Steve 30 
3 Mark 40 
4 Josh 50 


Name Age 
0 Alice 10 
1 Helen 20 
2 Fallon 30 
3 Silvia 40 
4 Jane 50 


Now let’s merge the two dataframe objects according to a specific key. 
The key must found in both dataframe objects. For example, we can merge 
them by age: 

dt=pd.merge(dt1,dt2,on='Age’) 
This function would merge according to the key passed as an argument to 


‘on’ parameter. It creates a new dataframe object as follows: 


Name_x Age Name_y 
0 Brian 10 Alice 

1 Peter 20 Helen 

2 Steve 30 Fallon 

3 Mark 40 Silvia 

4 Josh 50 Jane 


We can notice that the resulted dataframe object is a combination of the two 
dataframe objects where they share the same age column. We can also 
merge dataframe objects according to several keys by passing a list of keys 


to ‘on’ argument of the merge() function. 


The pandas library allows specifying how to merge the dataframe objects 
using the ‘how’ option. The ‘how’ option of the merge() function supports 
four options: left, right, inner, and outer. If the argument right is chosen, 
then the merging is applied using the key from the second dataframe object. 
If the left is specified, the merging is applied using keys from the first 
argument. If outer is specified for the merge() function, then it applies the 
union of keys. In contrast, if the inner is chosen, then it applies the 


intersection of keys to merge the dataframe objects. 


To combine two dataframe objects in a single dataframe object, the concat() 
function can be used. The concat() function concatenates not only 
dataframe objects but also series objects. Let’s for example concatenate the 
two dataframe objects of female and male in a single dataframe object: 
import pandas as pd 
male_list={'Name': ['Brian','Peter','Steve',"Mark','Josh'], 
‘Age’: [10,20,30,40,50] 
} 
female_list={ 'Name’: ['Alice','Helen','Fallon’,'Silvia','Jane'], 
'Age':[10,20,30,40,50] 
} 
dt1=pd.DataFrame(male_list) 
dt2=pd.DataFrame(female_list) 
dt=pd.concat([dt1,dt2]) 
print(dt) 


The function returns: 
Name Age 
0 Brian 10 
1 Peter 20 
2 Steve 30 
3 Mark 40 
4 Josh 50 
0 Alice 10 
1 Helen 20 
2 Fallon 30 
3 Silvia 40 
4 Jane 50 


We can set an indexing for the first and second dataframe objects in the 
concatenated dataframe object as follows: 
dt=pd.concat([dt1,dt2],keys=['Male','Female']) 
print(dt) 
The output is as presented below: 
Name Age 
Male 0 Brian 10 
1 Peter 20 
2 Steve 30 
3 Mark 40 
4 Josh 50 
Female 0 Alice 10 
1 Helen 20 
2 Fallon 30 
3 Silvia 40 
4 Jane 50 


Note in the resulted concatenated dataframe object in both examples 

presented above, the indexing is repeated. We can overcome this issue by 

setting ignore_index to True. This option allows ignoring the indexing in 

the dataframe objects passed to concat() function. 
dt=pd.concat([dt1,dt2],keys=['Male’,'Female'], ignore_index=True) 
print(dt) 


The concat() function with ignore_index set to True returns: 
Name Age 
0 Brian 10 


1 Peter 20 
2 Steve 30 
3 Mark 40 
4 Josh 50 
5 Alice 10 
6 Helen 20 
7 Fallon 30 
8 Silvia 40 
9 Jane 50 


Note that when setting ignore_index to True, the keys are overridden too. In 
this case, the function used new indexing for the concatenated dataframe 


object. 


We can use another built-in function that is also concatenating several 
dataframe objects of time series which append() function. Let’s create three 
dataframe objects and apply the append() function to concatenate these 
dataframe objects. 
dt1=pd.DataFrame({'Name’: ['Brian’,'Peter','Steve'], 'Age': [10,20,30]}) 
dt2=pd.DataFrame({ 'Name': ['Alice’,'Helen’,'Fallon'], 'Age':[10,20,30] }) 
dt3=pd.DataFrame({'Name':['Mark','Josh','Silvia'], 'Age':[20,30,50]}) 
dt=dt1.append([{dt1,dt2,dt3]) 
print(dt) 
The append function returns: 
Name Age 
0 Brian 10 
1 Peter 20 
2 Steve 30 


0 Brian 10 
1 Peter 20 
2 Steve 30 
0 Alice 10 
1 Helen 20 
2 Fallon 30 
0 Mark 20 
1 Josh 30 
2 Silvia 50 


The pandas library has two major functions to import the dataset into the 
workspace, namely read_csv() and read_table(). The first function 
read_csv() allows importing data from an excel file. The basic syntax for 
this function is: 
pandas.read_csv(path_File) 

We can also customize the different options this function offers by 
specifying the following options. The separating character and delimiter in 
the excel sheet, ignoring or not the header, names, index for the columns as 
well as names of the header and the how many rows to be ignored. The 
detailed syntax of the read_csv() function is as presented in the following 
statement: 

pandas.read_csv(path_File, sep=';', delimiter=None, header='infer’, 


names=None, index_col=None, usecols=None, skiprows=3) 


We can also change the type of a variable in the imported table by using 
dtype as follows: 
pandas.read_csv(filepath, dtype={'variable_name': np.float64}) 


The function read_table use the same syntax as read_csv to import tabular 


data as dataframe objects into Python workspace. 


The library pandas allow handling dataframes as tables and performing 
SQL operations. It allows selecting, selecting with filters, group by. Let’s 
create, for example, the following dataset and save it into an excel file name 


test.csv: 


Name Gender Age Smoker 
Silvie female 14 yes 
Jean male 19 yes 

Joe male 25 no 
Brian male 80 no 
Helen female 60 yes 
Maxime female 18 no 


First, we should import the data as a dataframe object by naming the 
function read_csv(): 
import pandas as pd 


data=pd.csv_read(‘test.xlsx') 


The dataframe object should be similar to the output below: 


Name Gender Age Smoker 
O Silvie female 14 yes 
1 = Jean male 19 yes 
2 Joe male 25 no 
3‘ Brian male 80 no 
4 


Helen male 60 yes 


5 Maxime female 18 no 


Now, let’s explore how can we perform some of the SQL operations. To 
select a variable (column) from the table, we can simply specify the name 
of the variable between the bracket. For example, to get the age of all 
individual in the table, we type the following statement: 
Age=datal['Age'] 
print(Age) 


The output is: 

014 

119 

225 

3 80 

4 60 

5 18 

Name: Age, dtype: int64 


We can also select several variables in one single statement by passing the 
list of the variables between the bracket. Let’s say we want the age and 
whether the individual is a smoker or not. We can run the following 
statement: 

Sub_data=data[['Age','Smoker']] 

print(Sub_data) 


The output is a table with the two variables ‘Age’ and ‘Smoker’: 


Age Smoker 
0 14 yes 
119 yes 
2 25 no 
3 80 no 
4 60 yes 
5 18 no 


We can apply a filter when selecting a variable. For example, let’s select an 
individual with age over 20 years. We specify between brackets the 
condition on which the data should be selected as follows: 
Sub_data=data[data['Age']>20] 
print(Sub_data) 
The function returns all variables for the individuals with age over 20 years 
as follows: 
Name Gender Age Smoker 
2 Joe male 25 no 
3 Brian male 80 no 


4 Helen male 60 yes 


We can also group data by specific characteristics using the groupby() 
function. For instance, we can group our data according to their gender as 
follows: 

grouped_data=data.groupby('Gender') 
After grouping them by their gender, we can get the number of individual 
for each group using the size() function: 

print(grouped_data.size()) 


The output, in this case, is the number of female and male in the dataset: 


Gender 
female 2 
male 4 
dtype: int64 


Overall, the pandas library is very handy to manipulate, explore, and 
analyze datasets. Now let’s see the tools that we can use to make figures 


that help visualize and extract valuable information from datasets. 


There are different tools in Python to create figures in Python. One forward 
library is a plot() library. Another advanced library with more options is the 
matplotlib library. These libraries allow plotting to scatter plots to detect 
correlation among variables. They allow creating figures of the histograms 
and distributions of data, which is very handy to make the first assumptions 
of the data distribution shape. Overall these libraries offer major features for 
data exploration analysis. We will use these libraries in the following 
chapter to develop machine learning applications. It is best to explore these 
libraries with real data. Therefore, we will not cover the details of these 
libraries in this chapter. You will learn the details of how to use these 


libraries in the next chapter with real data. 


Another library that we will be using is the scikit-learn package. This 
package is an efficient Python tool for data analysis and machine learning. 
The scikit-learn library is based on the NumPy, SciPy and matplotlib 
libraries. It offers tools to perform data preprocessing, clustering, 
regression, model selection, classification, dimensionality reduction. It is 
free and open source library. This library offers several datasets in order to 


test machine learning algorithms. In the next chapter developing a machine 


learning model with Python, we will explore how to use this library to 
build, train, and evaluate machine learning models. All applications that we 
will be developing are based on the datasets available in this toolbox. 
Hence, we will not go through the options and functions of this toolbox in 
this chapter. We will rather go into the details of this library in the next 


chapter with examples using real data. 


Chapter 4: Developing a Machine Learning Model 
With Python 


In this chapter, we are going to apply the Python skills acquired in the 
previous chapter to develop machine learning models. We are going to use 
the Python libraries pandas for data processing, NumPy for numerical 
calculations and seaborn and matplotlib for visualization and finally the 
scikit-learn to use machine learning functions. We will be referring to the 
scikit-learn library as sklearn, which is the name used to import the library 


into Python workspace. 


In example 1, we will go through a simple example of supervised learning 
with logistic regression for data classification. In this example, we will be 
using the Titanic dataset. We will build a model that predicts whether a 

titanic passenger would have survived or not according to several criteria. 


The Titanic dataset is available within the seaborn library. 


In example 2, we will go through another method of supervised learning, 
which is linear regression. This example is based on the Boston housing 
dataset. We will develop a model that predicts house pricing using linear 
regression. The Boston housing is also available as a dataset in the sklearn 


library. 


Example 3 and example 4 are applications to develop artificial neural 
networks. Both examples are using the same dataset, which is the Iris 
dataset. Example 3 is an initiation to develop a neural network by 


developing a simple perceptron. Example 4 is an application to develop a 


multi-layer neural network. You will learn through this example how to 
develop and train an artificial neural network from scratch. You will also 


learn how to develop the same model using built-in functions in Python. 


All datasets used in the four examples are freely available online and are 
widely used as machine learning example applications. In these examples, 
you will learn the necessary steps to develop a machine learning model for 


different applications. 


Through these examples, you will not only learn how to develop machine 
learning models but also some new Python skills for data cleaning, plot, and 
visualize data. Basically, you will learn how to explore your data, how to 
check the presence of missing values, and how to convert your data into a 
data frame that is easily processed. You will also learn how to manage 


options of the plot functions to make readable figures. 


Example 1: Logistic Regression 


This application aims at developing a supervised machine learning model based on 
logistic regression. So, what are you going to learn in this example? 

Through this example, you will learn the necessary steps to develop a logistic 
regression model with Python and how to use the built-in functions in Python libraries 
to build a logistic regression model. You will learn how to import data, how to 
visualize data and retrieve important information. You will also learn how to clean the 
dataset before building a machine learning classifier with logistic regression. We are 
also going cover how we can convert categorical variables into dummies variables 
which will be comprehensible by a regression model. Finally, we will go through how 
to create a training set from data, how to fit and evaluate the accuracy of the model. 
You will learn what a confusion matrix is, compute a confusion matrix, and how to 


analyze this matrix. 


In this example, we are using the Titanic dataset that is widely used to illustrate a 
logistic regression model. Titanic dataset can be freely obtained from 
httpps://www.kaggle.com or can be loaded directly from the seaborn library. We all 
know the story of Titanic and the significant life loss that has been caused by the 
shipwreck due to the limited number of lifeboats. However, they were some 
passengers more likely to survive the sinking than others. These passengers that had 
luck play in their favor to survive the sinking were mainly children, women, and 


passenger of the upper class. 


From the Titanic dataset, we are going to develop a predicting model that tells us 
whether a passenger on the Titanic would have survived or not based on different 


criteria. 


First, we are going to load Python packages: 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 


import seaborn as sns 


Then, we import the titanic data coming from the seaborn library: 
Titanic_data=sns.load_dataset('titanic’) 


Before starting the machine learning process, let’s explore the dataset first. This step 
of data exploratory analysis is very important to understand how the data is stored and 
detect if there are any missing data or outliers. 


With the following function, we are going to explore how the data is stored. It shows 
the first few lines of the data: 
Titanic_data.head() 


The output is as described in the following table. The first column provides the 
passenger Id; the second column is a binary variable (i.e., 0 or 1) that represents 
whether the passenger had survived or not. The following columns provide 
information about the passenger: pclass which is the ticket class (1: upper, 2: middle 
and 3 is a lower class), sex, age, sibsp number of siblings abroad, fare (i.e., passenger 
fare) and so on. Note there is an additional variable alive that provides the same 


information as a survived variable. 





survived pclass sex age stbsp parch fare embarked class who adult_male deck embark_town alve alone 
0 0 3 mae 22.0 i 0 72500 S Third man Te NeN Scuthampion ho False 
1 1 femae 34.0 0 712833 Cc First woran False Cc Cherbourg yes Fabe 
2 1 3 femae 26.0 5] 0 7250 S Third wonan False NaN Scuthariplon yes True 
3 1 famae 35.0 0 53.1000 S First wornan False C Scuthampion yes False 
4 0 3 mae 35.0 0 &500 S Third man True NaN Scuthampion ho True 


Sample of the Titanic dataset visualized with head() function. 


Now let’s see the size of the data. In other words, let’s find how many passengers are 
stored in this dataset. To do so, we use the function: 
Titanic_data.count() 


The output of this function is: 


survived 891 
pclass 891 
sex 891 

age 714 

sibsp 891 
parch 891 
fare 891 
embarked 889 
class 891 
who 891 
adult_male 891 
deck 203 
embark_town 889 
alive 891 
alone 891 


dtype: int64 


In total, 891 passengers, including the staff, were in the ship. We can see at this stage 
that there are some missing values for the variable age, deck, embarked, and 
embark_town. We will deal with this later. Now let’s see how many people have 
survived in the Titanic. We are going to plot the number of people that had survived 
and had not survived as bars with countplot() function: 

sns.countplot (x='survived', data=Titanic_data, palette=["k", "k"]) 


The countplot() function implemented in the seaborn library takes as input: x the 
variable for which we compute the frequency, data: the dataset. Palette is an option to 
specify the colors for each group, i.e., bar. In this example, both bars: survived and 


not survived are in black. The output of this function is given in the figure below. We 


note that more than 500 people did not survive the Titanic. 











survived 


Bar plot of people that survived and did not survive the Titanic. 


Now let’s see the number of survivals according to the sex of passengers. To do so, 
we are going to add an option or a filter to the countplot() in order to specify that we 
want the counting of survivals for each sex separately. We are also going to change 
the colors. Because in this plot we will have four bars, we are going to create a palette 
color for our bar chart. This palette will be then passed as an argument for the 
countplot() function. 
seq_palette=sns.color_palette('Greys',4) 
sns.countplot (x='survived', hue='sex', data=Titanic_data, palette=seq_palette) 


The output should be similar to the figure below: 





male 


fernale 


Survived passengers from the Titanic according to gender. 


This figure shows us the more women have survived the Titanic than man. More than 

400 men did not survive the Titanic, and only 100 mean approximately did survive 

the Titanic. 

We can plot the same information according to the social class of passengers: 
sns.countplot (x='survived', data=Titanic_data, hue='pclass', palette=seq_palette) 


pclass 
1 1 








surveved 


Survived passengers from the Titanic according to social class. 


The figure shows the majority of people that did not survive the Titanic are from the 


lower class. 


Now let’s check the missing values in the dataset. To do so, we are going to compute 
the number of null values in the dataset: 
Titanic_data.isnull().sum() 


The function isnull() returns the variables with null values and sum() computes the 


sum. The output of the function given above is: 


survived 0 
pclass 

sex 0 

age 177 


sibsp 0 


parch 0 
fare 0 
embarked 2 
class 0 
who 0 
adult_male 0 
deck 688 
embark_town 2 
alive 0 
alone 0 
dtype: int64 


We can notice the 177 values are missing for the variable age and 688 is missing for 


the deck variable and 2 missing values embark_town. There are several approaches to 


estimate missing values. These approaches are not covered in this book. We will 


rather focus on how to develop and train a prediction model. Therefore, these 


variables will not be used in the model and are going to be ignored. These variables 


will be dropped from the dataset with drop() function: 


Titanic_data.drop(‘age',axis=1, inplace=True) 
Titanic_data.drop(‘embark_town',axis=1, inplace=True) 
Titanic_data.drop(‘deck',axis=1, inplace=True) 
Titanic_data.dropna(inplace=True) 

Titanic_data.head() 


The axis option in the drop() function allows to specify if we want to drop column or 


row. Now the dataset is similar to this: 


Wn = & 


4 


survived pclass sex sibsp parch fare embarked class who adult_male alive alone 
0 3 male 1 0 7.25 S Third man True no False 
1 1 female 1 0 71.2833 C First woman False yes False 
1 3 female 0 0 7.925 S Third woman False yes ‘True 
1 1 female 1 0 53.1 S First woman False yes False 
0 3 male 0 0 8.05 S Third man True no True 


Titanic dataset after deleting variables with missing values. 


Now looking at the dataset above, we can see that ‘who’ variable, ‘sex’ variable and 


‘adult_male’ provide the same information. The variable ‘alive’ is the same as the 


variable ‘survived.’ The variable ‘class’ and ‘pclass also give the same information. 
Finally, we can retrieve the same information from the variable ‘alone’ and the 
variable ‘sibsp.’ Therefore, the variables ‘who,’ ‘adult_male,’ ‘class,’ ‘alive’ and 
‘alone’ will be dropped from the dataset. 

Titanic_data.drop(['class', 'who', ‘adult_male’, ‘alive’, 'alone'], axis=1,inplace=True) 


The new Titanic dataset is like this: 


survived _ pclass sex sibsp parch fare embarked 
0 0 3. male if 0 7.25 S 
1 1 1 female 1 0 71.283 C 
2 il 3 female 0 0 7.925 S 
3 1 1 female 1 0 53.1 S 
4 0 3 male 0 0 8.05 S 


The final dataset of Titanic. 


Note that the variables ‘sex’ and ‘embarked’ are categorial. In order to make these 
variables easily processed by the machine learning algorithm, we need to transform 
these variables into dummies variables using the pandas library. Dummy variables are 
variables that take 0 or 1 value to indicate the presence of the category effect. In our 
example, we will create a dummy variable for ‘sex’ variable that is equal to 1 if 
gender is male and 0 otherwise. Another dummy variable that takes the value of 1 is 
embarked is S and 0 otherwise. 

sex=pd.get_dummies(Titanic_data['sex'], drop_first=True) 

embarked=pd.get_dummies(Titanic_data['embarked'], drop_first=True) 


Then after, we drop the original ‘sex’ and ‘embarked’ variables and concatenate the 
new dummy variables into our Titanic dataset: 
Titanic_data.drop(['sex','embarked'],axis=1,inplace=True) 
Titanic_data = pd.concat([Titanic_data, sex, embarked],axis=1) 


Our final titanic dataset should like presented in the table below: 


survived pclass_ sibsp parch fare male Q S 
0 0 3 1 0 7.25 1 O 1 
1 1 1 1 0 71.28 0 0 0 
2 1 3 0 0 7.925 0 0 il 
3 1 1 1 0 53.1 0 0 1 
4 0 3 0 0 8.05 1 O il 


Now, our dataset is ready to create a logistic regression model. First, split our data 
into two datasets; the first that we will call the training dataset will be used to train the 
model. The second dataset will be called the test dataset will be used to assess the 
model’s precision. To do so, we will employ the sklearn toolbox that we are going to 
import first. Then, we will apply the function train_test_split() to perform the splitting 
of the data. 


In this example, the data features X we are using as predictors to predict if a 
passenger would have survived or not are: social class (pclass), number of siblings on 
board (sibsp), number of parents or children on board (parch), the fare, sex and 
embarked town. The labels Y to this dataset is the variable survived. When 
performing the data splitting with the train_test_split() we need to specify X, here is 
the Titanic dataset without the variable survived and the labels Y in this example is 
the survived variable. We also need to specify the percentage of the data that will be 
used as test data. Here, we will use 30% of the dataset to form the test data. The 
function is as follows: 

X= Titanic_data.drop(‘survived',axis=1) 

Y= Titanic_datal['survived'] 

from sklearn.model_selection import train_test_split 

Titanic_train, Titanic_test, Survived_train, Survived_test= train_test_split(X, Y, 
test_size=0.30,random_state=101) 


Now we verify the size of our training and test datasets. 
print('Size of the training dataset is ', Titanic_train.shape[0]) 
print('Size of labels of the training dataset is', Survived_train.shape[0]) 
print('Number of variables or features in the training dataset is ', 
Titanic_train.shape[1]) 
print(Size of the test data is ', Titanic_test.shape[0]) 
print('Size of labels of the test dataset is ', Survived_test.shape[0]) 


The output of this code is: 

Size of the training dataset is 622 

Size of labels of the training dataset is 622 

Number of variables or features in the training dataset is 7 
Size of the test data is 267 

Size of labels of the test dataset is 267 


Now we have formed a training set which contains 70% of Titanic data. The test data 
is formed by 30% of the data as we have specified to train_test_split() function via the 


argument test_size. 


Now let’s build the logistic model. The sklearn library already has a built-in function 
to develop a logistic regression model which we are going to import: 
from sklearn.linear_model import LogisticRegression 


After this, we will train the logistic regression model to the data: 
Inmodel = LogisticRegression() 


Inmodel.fit(Titanic_train, Surivived_train) 


After fitting the model, we can use it to predict if passengers in the test dataset would 
have survived or not: 


Survival_Predictions = Inmodel.predict(Titanic_test) 


The sklearn library also has a built-in function to evaluate the accuracy of the 
predictions made by the logistic regression model. 


from sklearn.metrics import accuracy_score 
print('Model accuracy is %2.3f", % accuracy_score(Survived_test, 
Survival_Predictions)) 


This function returns: 


Model accuracy is 0.824 


The model is 82 % accurate, which means that the prediction made by the model is 82 
% accurate. Now let’s look at the confusion matrix. This matrix is very useful in 
appraising the machine learning classifier’s efficiency as the one we developed for the 
Titanic data. 


First, let’s define what a confusion matrix is. The confusion matrix is a table 
providing a combination of predicted values and true values. In the example, we are 
working on we have predicted and true values of 0 and 1, i.e., survived or not 
survived. The confusion matrix will provide the frequency of estimated values as 1 
when true values are equal to 1, frequency of estimated values of 1 whereas true 
values are equal to 0, frequency of estimated values of 0 whereas true values are 1 and 
frequency of estimated values of 0 when true values are also 0. Basically, the 
confusion matrix describes the situations when the model predicts a “true positive” 
when the positive value predicted is the actual positive value. It also includes a 
prediction of a “false positive” if it’s a positive value while actual value is negative, 
and it predicts negative value while actual value is positive. Finally, it is “true 
negative” when it predicts a negative value, and the actual value is negative (true 


negative). Theoretically, the table is as follows: 


Confusion matrix 


Predicted/True Positive (1) Negative (0) 


values 


Positive (1) (1|1) true (1|0) false 


positive Positive 





Negative (0) (0|1) false (0|0) true 


negative negative 





Now let’s look at the confusion table of our model. 
from sklearn.metrics import confusion_matrix 


confusion_matrix(Survived_test, Survival_Predictions) 


The returned confusion matrix is: 
array(([152, 11], [ 36, 68]], dtype=int64) 


We can see that the number of times the model predicted false positive and false 
negative are very low compared to true positive and true negative, which reflects 


good accuracy and precision of the model. 


Example 2: Linear Regression 


This example aims at presenting an application of machine learning to develop a 
linear regression model. So, what will you learn through this example? 
In this machine learning example, you will learn how to handle dictionary object in 
Python and how to get it features and keys. You will learn how to convert a dictionary 
object to a dataframe object in order to use pandas to analyze the data. You will learn 
how to plot data histogram and distribution as well as customizing the figure. You will 
learn to compute the summary statistics of the dataset. You will learn how to develop, 
fit, and evaluate the accuracy of a linear regression model. 
In this example, the Boston housing pricing will be used. This data is available 
through the sklearn library. The Boston dataset is collected information about housing 
in Boston, MA. by U.S Census Service. This dataset was published first time in 1978 
by Harrison and Rubinfel in the Journal of Environmental Economics and 
Management. Therefore, the data is not up to date is used for illustration purposes. 
Like in example 1, we start first by importing the Python libraries and loading the 
dataset: 

import numpy as np 

import pandas as pd 

#Visualization Libraries 

import seaborn as sns 

import matplotlib.pyplot as plt 

from sklearn import datasets 

boston_raw = datasets.load_boston() 


If we check the type of this dataset with the type() function: 
print(type(boston_raw)) 

The dataset is in the shape of a dictionary. We can get the dimensionality and the 

attributes of the dataset with the following commands: 


print(‘Keys of the Boston dataset stored as dictionary are:’) 


print(boston_raw.keys()) 

print(‘The dimension of the Boston dataset is') 
print(boston_raw.data.shape) 

print(‘The name of features in the Boston dataset are’) 


print(boston_raw.feature_names) 


Now let’s look at the output of this statement: 


Keys of the Boston dataset stored as dictionary are: dict_keys(['data’, 'target', 'feature_names', 'DESCR', 
‘filename']) 

The dimension of the Boston dataset is (506, 13) 

The name of features in the Boston dataset are ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM''AGE' 'DIS' 'RAD''TAX' 
'PTRATIO' 'B''LSTAT"] 


We have 506 observation and 13 variables where the name is the name of the features. 
Before going further in our data analysis, let’s understand what each feature 
represents. 

e CRIM: crime rate by town per capita 

e ZN: the proportion of residential land zoned for lots over 25 000 sq.ft 

e INDUS: the proportion of non-retail business acres per town 

e CHAS: Charles River dummy variable (1 if tract bounds river, 0 

otherwise) 

e NOX: nitric oxides concentration (parts per 10 million) 

e RM: average number of rooms per dwelling 

e AGE: the proportion of owner-occupied units built prior to 1940 

e DIS: weighted distances to five Boston employment centers 

e RAD: index of accessibility to radial highways 

e TAX: full-value property-tax rate per $10 000 

e PTRATIO: pupil-teacher ratio by town 

¢ B: 1000(Bk -0.63)\2 where Bk is the proportion of blacks by town 

e¢ LSTAT: % lower status of the population 

e MEDV: Median value of the owner-occupied home in $1000’s 


You can get the description of the features presented above with the function: 


print(boston_raw. DESCR) 


These features are the features that we will be used to predict the price of the house. 
Values of the housing are given by the ‘target’ key in the dataset. The dataset, as we 
have explored, is stored as a dictionary. We need to convert it to a data frame in order 
to be able to apply the pandas and numerical computation necessary to build our 
model. To do so, we run the following statements: 

boston_dt = pd.DataFrame(boston_raw.data, columns = 
boston_raw.feature_names) 

boston_dt['PRICE'] = boston_raw.target 
The first statement converts the Boston housing dataset from a dictionary to a 
dataframe object. The second statement concatenate the created data frame and our 
target variable, which is the price. Now if we explore the first few lines of the dataset 
with the head() function, it looks like : 


CRIM ZN INDUS CHAS NOX RM = AGE DIS RAD TAX \ 

®@ @.00632 18.0 2.31 6.0 6.538 6.575 65.2 4.8998 1.0 296.0 
1 8.62731 6.6 7.67 6.6 6.469 6.421 78.9 4.9671 2.6 242.6 
2 8.602729 6.6 7.07 8.0 @.469 7.185 61.1 4.9671 2.6 242.8 
3 6.63237 8.0 2.18 98.8 6.458 6.998 45,8 6.8622 3.8 222.8 
4 8.069905 8.0 2.18 98.0 0.458 7.147 54.2 6.0622 3.8 222.0 

PTRATIO B LSTAT PRICE 
8 15.3 396,96 4.98 24.6 
1 17.8 396.98 9.14 21.6 
2 17.8 392.83 4.03 34.7 
3 18.7 394.63 2.94 33.4 
- 18.7 396.98 $.33 36.2 


The Boston housing dataset. 


Now we are going to do some exploratory analysis to better understand the data. We 
check first if there are any missing values: 
boston_dt.isnull().sum() 


Like in the first example, we computed the number of observations with missing 


values. This dataset does not contain any missing values: 
CRIM 0 
ZN 0 
INDUS 0 


CHAS 
NOX 

RM 

AGE 

DIS 

RAD 
TAX 
PTRATIO 
B 

LSTAT 
PRICE 
dtype: int64 


SOS COD POC co 


Now let’s explore the range of house prices and their distribution. We are going to plot 
a histogram of the house prices to get a visual idea of the prices. We are going to use 
the seaborn library to plot the prices distribution and matplotlib library to control the 
figure features. 
plt.figure(figsize=(5, 4)) # The figure size 

sns.distplot(boston_dt['PRICE'], color='grey') # plot the distribution and 
histogram 

plt.xlabel(‘price ($1000s)') # label of X-axis 

plt.ylabel('Frequency') # label of Y-axis 

plt.tight_layout() 

plt.rc(‘xtick', labelsize=16) # Font size of the x-axis label 

plt.rc(‘ytick', labelsize=16) # Font size of the y-axis label 

plt.title(Histogram of House pricing’, fontsize=18) # Title of the figure 


The figure below provides the histogram and the distribution of the housing prices in 


Boston: 


Histogram of House pricing 
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Distribution of the housing prices in Boston. 


We can also plot the histogram with the matplotlib library with the following function. 
hist(boston_dt['PRICE']) 


This function, however, does not plot the curve of the distribution. 


We can see from the figure above that price value are normally distributed with a 
median around 20 and values that range between 10 and 40 on a scale of $1000. We 


have some outliers with values around 50 in a $1000 scale. 


Now let’s explore the statistics of the features of the Boston housing prices dataset. 
We can get the basic statistics by the following function: 
Boston_dt.describe() 


This function returns a table with the main statistics for each feature as presented in 
the table below: 


Summary Statistics of variables of the Boston housing dataset. 


CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B-~ LSTAT PRICE 
count 506 506 506 506 506 506 506 506 506 506 506 506 506 506 


mean 4 11 11 0 1 6 69 4 10 408 18 357 13 23 
std 9 23 i 0 0 1 28 2 9 169 2. 91 7 9 
min 0 0 0 0 0 4 3 1 1 187 13 0 2 5 
25% 0 0 5 0 0 6 45 2 4 279 17 375 z 17 


50% 0 0 10 0 1 6 78 3 5 330 19 391 11 21 
75% 4 13 18 0 1 7 94 5 24 666 20 396 17 25 
max 89 100 28 1 1 9 100 12 24 711 22 397 38 50 


The first row is the feature variable names. The second row is the number of 
observations for each feature variable. As we have already seen at the beginning of 
this example, we have 506 observations in this dataset. The third row is the average 
value for each feature. As described by the distribution function, the average value of 
prices is 23 in $1000’s. The following statistic is the standard deviation. Some 
variables have high standard variation value like the variable ‘B’ which is a variable 
considering the proportion of blacks by the town. High values of standard deviation 
mean that the variable values are widespread. On the contrary, if the standard 
deviation is very low, it means that most of the values are around the average value. In 
other words, the high value of standard deviation means the variable ranges are wide, 
and low standard deviation means that the variable ranges are narrow and close to the 


average value. 


We will try to forecast value of houses by developing a linear regression model. So, 
we are going to explore the correlated feature variables with respect to the price 
variable. We will plot the price variable versus each feature variable in a for loop: 
feature_name=['ZN', 
‘'INDUS','CHAS','NOX','RM','AGE','DIS',"RAD',"TAX’,"PTRATIO’,'B',"LSTAT"] 
fig = plt.figure(figsize=(10, 12)) 
for i in range(0, len(feature_name)) : 
ax = fig.add_subplot(4, 3, i+1) 
plt.scatter(boston_dt[feature_name[i]], 
boston_dt['PRICE'],color='grey') 
plt.ylabel(‘Price’, size=14, weight='bold') 
plt.xlabel(feature_name[i], size=14,weight='bold’) 
plt.tight_layout() 


The figure below shows the scatter plot of each feature variable and the price variable. 





Scatter plot of the feature variables and the price variable of the Boston housing 
dataset. 


We can see that some variables are highly correlated with the price like the ‘RM’ 
variable as the average quantity of chambers per dwelling and the ‘LSTAT” variable 
as the percentage of the population’s lower status. This variable is positively 
correlated with price. The price goes up when the number of rooms is increasing. The 
LSTAT is negatively correlated with the price. The price decrease when the % lower 


status of the population is increasing. We can also observe a moderate positive 


correlation between the price and the variable ‘DIST,’ the distance to five Boston 
employment centers. We can also notice that prices over 50 are not correlated with 
any variable. Let’s confirm this statement by computing the correlation coefficient 
with the corrcoef() function of the Numpy library. 
Selec_feature=['LSTAT','RM','DIS'] 
for i in range(0,len(Selec_feature)): 
print(‘Correlation coefficient between Price and ', Selec_feature[i], '=", 
np.corrcoef(boston_dt[Selec_feature[i]],boston_dt['PRICE'])[1][0]) 
The statements above returns: 
Correlation coefficient between Price and LSTAT = -0.73766 
Correlation coefficient between Price and RM = 0.6953599 
Correlation coefficient between Price and DIS = 0.249928 


As the variable LSTAT is the most correlated variable with the house prices, we are 
going to develop a linear regression model that predicts house prices based on this 
house feature. In other words, we build a model that explains or predict the house 
price according to the LSTAT: 
X_boston=boston_dt['LSTAT"] 
Y_boston=boston_dt['PRICE'] 
We convert first our data into np.arrays: 
X_boston=np.array(X_boston).reshape(-1,1) 
Y_boston=np.array(Y_boston).reshape(-1,1) 


Now, let’s split the data onto two datasets. We will form a test set and training set as 
we did for the logistic regression example: 

from sklearn.model_selection import train_test_split 

boston_train, boston_test, price_train, price_test= 

train_test_split(X_boston, Y_boston, test_size=0.30,random_state=101) 


We verify the size of all subset of data: 
print('‘Size of X_train is', boston_train.shape) 
print('Size of Y_train is', price_train.shape) 


print('Size of X_test is',boston_test.shape) 


print('Size of y_test is',price_test.shape) 


The output of these statements is 
Size of X_train is (354, 1) 
Size of Y_train is (354, 1) 
Size of X_test is (152, 1) 
Size of y_test is (152, 1) 


The training set is 354 in size, and the size of the test set is 152. 
Now we import from the sklearn library the module linear regression and fit the 
model: 
from sklearn.linear_model import LinearRegression 
LnReg = LinearRegression() 
LnReg.fit(boston_train, price_train) 


After running the statements above, fit function returns: 
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, 


normalize=False) 


To evaluate model performance, the fitted model is applied to estimate house prices 
with the feature LSTAT in the test dataset boston_test: 
y_estimated=LnReg.predict(boston_test) 


Then, we compute the coefficient of determination R? and the Root Mean Square 
Error of the estimated data. The RMSE should be close to 0 and R? should be close to 
iL: 
from sklearn.metrics import mean_squared_error 
RMSE=(np.sqrt(mean_squared_error(y_test, y_estimated))) 
R2=round(LnReg.score(X_test, y_test),2) 


The RMSE of this model is 6.95, and the R? is 0.51. 


Now let’s develop a multi-variable linear model and compare it with a single variable 
that we have developed. We follow the same steps. We don’t need to import all the 
libraries because they are already imported. 

Y=boston_dt 

X=boston_dt.drop(‘PRICE',axis=1) 

X_train, X_test, y_train, y_test= train_test_split(X, Y, 

test_size=0.30,random_state=101) 

LnReg1 = LinearRegression() 

LnReg1.fit(X_train, y_train) 


We evaluate the model performance to forecast the prices of the training dataset: 
y_train_estimated = LnReg1 .predict(X_train) 
RMSE = (np.sqrt(mean_squared_error(y_train, y_train_estimated))) 
R2 = round(LnReg1.score(X_train, y_train),2) 
print("RMSE for the training dataset is: ', RMSE) 
print(‘R2 for the training dataset is: ', R2) 


The RMSE for the training dataset is 1.17 and R2 is 1, which is expected because this 
dataset was used for training the model. Now we evaluate the model performance to 


predict the house prices for the test data set. 


y_test_estimated = LnReg1 .predict(X_test) 

RMSE = (np.sqrt(mean_squared_error(y_test, y_test_estimated))) 
R2 = round(LnReg1.score(X_test, y_test),2) 

print(RMSE for the test dataset is: ', RMSE) 

print('R2 for the test dataset is: ', R2) 


The RSME for the test dataset is 1.42 and R2 is almost 1. We can see by including 
more features in the model we increased the accuracy of the model. The scatter plot of 
the actual prices versus the estimated price confirm the accuracy of the model. 

fig = plt.figure(figsize=(5, 4)) 

plt.scatter(y_test, y_test_estimated,color='grey') 


plt.ylabel('Estimated price ($1000)', size=14,weight='bold') 
plt.xlabel(‘True price ($1000)', size=14,weight='bold') 
plt.tight_layout() 
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Scatter plot of predicted prices and the actual prices. 


Example 3: Perceptron 


The current example is an initiation to develop a neural network with 
Python. We will develop a perceptron, a single layer neural network, to 
classify the Iris data. You will learn in this example how to explore the 
dataset and create figures. You will learn how to develop, train, and 
evaluate the accuracy of a perceptron using the built-in functions of Python. 
The Iris data used in this example as well as in the following example is a 
multivariate dataset widely used in machine learning. It is used as an 
application illustration of building a classifier. The Iris dataset is composed 
of several samples of 3 species of the Iris flower. Features that are available 
to discriminate between the species are both sepals and petals length and 
sepals and petals width. According to the combination of these features, we 
are going to build a model with a perceptron to discriminate between the 


species. This dataset is freely available within the sklearn library. 


First, we import Python libraries and load the data from the sklearn library: 
import numpy as np 
import pandas as pd 
#Visualization Libraries 
import seaborn as sns 
import matplotlib.pyplot as plt 
from sklearn import datasets 


Iris = datasets.load_iris() 


First, we check the type of the data: 
print(type(Iris)) 


The function returns: 

<class 'sklearn.utils. Bunch'> 
We can see that data is stored as a dictionary. Now we inspect the sizes of 
the different features of the dataset: 

print(‘Keys of the Iris dataset stored as dictionary are:’) 

print(Iris.keys()) 

print('The dimension of the Iris dataset is’) 

print(Iris.data.shape) 

print("The name of features in the Iris dataset are’) 


print(Iris.feature_names) 


The statements above return: 
Keys of the Iris dataset stored as dictionary are: dict_keys([‘data’, 
‘target’, 'target_names', 'DESCR’, 'feature_names'’, 'filename']) 
The dimension of the Iris dataset is (150, 4) 
The name of features in the Iris dataset are ['sepal length (cm)’, 'sepal 


width (cm)', ‘petal length (cm)', ‘petal width (cm)'] 


The four variables or features of the Iris flower are stored as follows: 


sepal length (cm) sepal width(cm) petallength(cm) petal width(cm) Species 


5.1 Shs) 1.4 0.2 0 
4.9 3 1.4 0.2 0 
4.7 3.2 1.3 0.2 0 
4.6 3.1 1.5 0.2 0 

fs) 3.6 1.4 0.2 0 


In total, we have 150 data of Iris flower. The four features, such as the petal 
width and length and sepal width and length, are measured in centimeter. 
We have 50 samples of each species. The species (i.e., target variable) are 
already in the form of dummy variable with values of 0, 1,2 where each 
index is associated with a specific specie. Iris setosa is indexed with 0; Iris 
verginica is indexed as 1 and Iris versicolor is indexed as 2. Now let’s get 
the summary statistics of each species using the following: 
Iris.data.describe() 


The summary Statistics are: 


petal length 


sepal length (cm) _ sepal width (cm) (aut petal width (cm) 

count 150 150 150 150 
mean 5.843 3.057 3.758 1.199 
std 0.828 0.436 1.765 0.762 
min 4.3 2 1 0.1 
25% hd 2.8 1.6 0.3 
50% 5.8 3 4.35 133 
75% 6.4 3.3 5.1 1.8 
max 7.9 4.4 6.9 2.5 


We retrieve the feature data and the target variable: 
X_Iris=Iris.data 


Y_Iris=Iris.target 


Let’s explore properties of each Iris species: 


plt.scatter(X_Iris[{np.where(Y_Iris==0),0],X_Iris[np.where(Y_Iris== 


0),1],color='k',marker='o0',label='setosa’) 


plt.scatter(X_Iris[np.where(Y_Iris==1),0],X_Iris[np.where(Y_Iris== 
1),1],color='k',marker='*',label='versicolor') 
plt.scatter(X_Iris[np.where(Y_Iris==2),0],X_Iris[np.where(Y_Iris== 
2),1],color='k',marker='+',label='verginica’) 

plt.xlabel(‘petal length (cm)') 

plt.ylabel(‘sepal length(cm)') 

plt.legend(loc='best’) 

plt.show() 


The figure below presents the sepal length versus the petal length for three 
Iris species setosa, versicolor, and verginica. We can already distinguish 


setosa species from the other two species according to the sepal and petal 
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Scatter plot of the sepal and petal length of the Iris species. 


From the scatter above of the three species we can see that there is no 
difference between versicolor and verginica species. So, we merge these 
two species into one single class as follows: 

Y_Iris= (Iris.target==0).astype(np.int8) 


print(Y_Iris) 
We simply assigned the index 1 to the setosa Iris and 0 to the others. The 
target variable now is: 
[(111111111111111111111111111111111111111 
11111111111000000000000000000000000000000 
0DDDDDDDDIDDDDDDDDDDNDDDNDDNDNDDNDNDD00000000000 
0DDD0000000000000000000000000) 


We divide the data onto 2 datasets one will be used for training the model, 
and the second for assessing the accuracy of the model: 
from sklearn.model_selection import train_test_split 
Iris_train, Iris_test, Species_train, Species_test = 


train_test_split(X_Iris, Y_Iris, test_size=0.3) 


In this example, we are going to use the sklearn library to build the 
perceptron: 


from sklearn.linear_model import Perceptron 


We make a perceptron object with a learning rate of 0.1 and a maximum 
iteration of 100: 
Perc_model = Perceptron(max_iter=100, tol=0.19,eta0=0.1, 


random_state=0) 


Then we fit the perceptron with the Iris_train dataset: 


Perc_model.fit(Iris_train, Species_train) 


The training statement returns: 


Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, 
eta0=0.1, fit_intercept=True, max_iter=100, n_iter=None, 
n_iter_no_change=5, n_jobs=None, penalty=None, random_state=0, 
shuffle=True, tol=0.19, validation_fraction=0.1, verbose=0, 


warm_start=False) 


The perceptron object has as attribute the coefficient (W), intercept (b), and 
a number of iterations to achieve convergence criteria. We can access these 
by calling the Perceptron object with the attributes: 

print(‘Perceptron weights: ', Perc_model.coef_) 

print('Perceptron bias:', Perc_model.intercept_) 


print(‘Number of itr to stop criteria:', Perc_model.n_iter_) 


The attributes of the Perceptron developed in this example are: 
Perceptron weights: [[ 0.09 0.44 -0.67 -0.3] [ 1.51 -2.16 1.28 -2.07] [-2.66 
-1.72 3.33 3.35]] 

Perceptron bias: [ 0.1 0.7 -1.1] 


Number of itr to stop criteria: 8 


Now we evaluate the performance of the perceptron to classify the Iris 
species of the test dataset. 

y_estimated = Perc_model.predict(X_test) 

from sklearn.metrics import accuracy_score 

print('‘Accuracy of the model : %.2f' % accuracy_score(y_test, 


y_estimated)) 


The accuracy of the perceptron is 0.98. This means that the model classifies 


98% of the data accurately. 


Example 4: Multi-Layer Neural Network 


The aim of this example is to guide you through the steps to develop a 
multi-layer neural network. In this application, you will learn how to 
develop an artificial neural network from scratch. In other words, we will 
not use any library to develop or train the artificial neural network. Instead, 
we will develop several functions that will serve to model each component 
of the neural network. We will develop functions that simulate the 
functioning of the artificial neural network and the algorithm gradient 
descent to fit the neural network. We will be developing all the necessary 
functions that compute the component of a neural network: weights, cost 
function, forward propagation, and back propagation and finally, the 
function that updates the weights. We will use the same dataset as the 
previous example, the Iris dataset. 
Finally, in this example, we will present how we can develop and train an 
artificial neural network using the built-in functions in Python. You will 
acquire knowledge of how to select a specific optimization algorithm to 
train the neural network, activation function, size of hidden layers as well as 
the parameter that sets the learning rate. To train the multi-layer neural 
network and compare their accuracy, we will employ two different 
algorithms. 
First, we have to load our packages and the data from the sklearn library: 

import pandas as pd 

from sklearn import datasets 

Iris = datasets.load_iris() 

X=Iris.data 
We have noticed in the previous example, developing a Perceptron as a 


classifier for the Iris data, that the Iris versicolor and verginica are very 


similar and cannot be distinguished. We merge these two species as a single 
class as we did by assigning an index 1 if it is 1 to setosa Iris follower and 0 
otherwise: 


Y= (Iris.target==0).astype(np.int8) 


To develop the multi-layer neural network from scratch, we are going first 
to write a function that returns the size of the dataset: 
def Get_sizes(X, Y): 
s_ x = X.shape[0] # Input layer size 
s_h = 7# hidden layers size 
s_y = Y.shape[0] # Output layer size 
return (s_x, s_h, s_y) 
We are going next to write a function that initialize the neural network 
parameters: 
def Init(s_X, s_h, s_Y): 
W1 = np.random.randn(s_h, s_X) * 0.01 #weight matrix of shape 
(s_h, s_x) 
b1 = np.zeros(shape=(s_h, 1)) #bias vector of shape (s_h, 1) 
W2 = np.random.randn(s_Y, s_h) * 0.01 #weight matrix of shape 
(s_y, s_h) 
b2 = np.zeros(shape=(s_Y, 1)) #bias vector of shape (s_y, 1) 
Param = {"W1": W1, 
"b1": b1, 
"W2": W2, 
"b2": b2} 


return Param 


The above procedure returns an initial weights matrices and vectors W1, 


W2, b1, and b2 according to the size of the input and output layers. 


Next, we define the activation functions we will use in the multi-layer 
network, namely the sigmoid and ReLu functions. 
def Sigmoid(Z): 
return 1/(1+np.exp(-Z)) 


def Relu(Z): 


return np.maximum(0,Z) 


Now, we are going to define a function that performs the forward 
propagation in the neural network. This function basically multiplies input 
data X by the weights of the first layer W1 and add the bias matrix b1. 
Then, it applies the ReLu activation that provides the output of the first 
layer A1. Next, it multiplies A1 by the weight matrix of the second layer 
W2 and adds the bias matrix b2. The Sigmoid activation is then applied to 
the output A2, which is the output of the network. This function returns 


outputs of each layer as well as the values of the activation functions. 


def fd_Propagation(Z,Param): 
Z1 = np.dot(Param['W1'],Z)+Param['b1'] 

A1 = Relu(Z1) 
Z2=np.dot(Param['W2'],A1)+Param['b2'] 

A2 = Sigmoid(Z2) 

ch={'Z1': Z1, 
'Al': Al, 
L222, 


'A2': A2 
} 


return ch 


After the forward propagation, we need to evaluate the accuracy of the 
estimated output. To do so, we write a function that compute the cost 
function. 
def cost_fct(A2,Y): 

m = len(Y) 

tmp = np.multiply(np.log(A2), Y) + np.multiply((1 - Y), np.log(1 
- A2)) 

cost = - np.sum(tmp) / m 


return cost 


The function defined above cost_fct() returns the neural network error. The 


function takes the estimated output and the target value as inputs. 


In order to fit the multi-layer network, we need to develop a gradient 
algorithm that will optimize the artificial neural network. In other words, 
we must estimate the best weight combination which minimizes the cost 
function. To do so, we need to compute the gradient for the activation 
functions. For that, we write the two functions that provide the gradient for 


the Sigmoid and Relu functions: 


def dRelu(Z): 
Z[Z<=0] = 0 
Z[Z>0] = 1 


return Z 


def dSigmoid(Z): 
c = 1/(1+np.exp(-Z)) 
dZ = c * (1-c) 


return dZ 


We also need to compute the gradient for the weight’s matrices. This 
process is called the backward propagation. We write the following function 
that we called bd_Propagation: 
def bd_Propagation(Param,ch,E,F,): 
m=E.shape[1] 
W1=Param|['W1'] 
W2=Param|"W2'| 
Al=ch['A1'] 
A2=ch['A2'] 
dF = - (np.divide(F,A2 ) - np.divide(1 -F, 1-A2)) 
dZ2 = dLoss_Yh * dSigmoid(Z2) 
dA1 = np.dot(W2.T,dZ2) 
dW2 = 1./A1.shape[1] * np.dot(dZ2,A1.T) 
db2 = 1./A1.shape[1] * np.dot(dZ2, np.ones({dZ2.shape[1],1])) 
dZ1 = dA1 * dRelu(Z1) 
dAO = np.dot(W1,dZ1) 
dW1 = 1./E.shape[1] * np.dot(dZ1,F) 
db1 = 1./E.shape[1] * np.dot(dZ1, np.ones({dZ1.shape[1],1])) 
gradients={ 
'dW1': dW1, 
‘db1': db1, 
'dW2': dW2, 


'db2': db2 
} 


return gradients 


This function takes as input the weights and the output of the forward 
propagation function (i.e. fd_Propagation) and the input and output layers 
to compute the gradients. Then, we write the function Update() that updates 
the gradients : 
def Update(Param, gradients, alpha): 

W1=Param['W1']-alpha*gradients['dW 1'] 

W2=Param['W2']-alpha* gradients['dW2'] 

b1=Param['b1']-alpha*gradients['db1'] 

b2=Param['b2']-alpha* gradients['db2'] 

Param = {"W1": W1, 


"b1": b1, 
"W2": W2, 
"b2": b2} 


return Param 


The Update() function updates the weights matrices by subtracting the 
product of the learning rate and the gradients for each weight matrix. The 
learning rate (called alpha) is a parameter that controls the speed by which 
the networks learn. The weights are updated by a proportion controlled by 
the learning rate. Now that we have developed all the functions necessary 
for the forward propagation, the cost function and back propagation, let’s 
assemble the algorithm the Gradient descent algorithm in order to fit the 
neural network. The gradient descent algorithm will perform in serial the 


following tasks: initialization of the parameters, perform the forward 


propagation, compute the cost function, perform back propagation to 
compute the gradients, update the parameters and repeat the process from 
the forward propagation until it attains the maximum number of iterations. 
The procedure is as follows: 
def Gradient_descent(X, Y, nbr_itr=500, print_cost_fct=False): 
np.random.seed() 
s_X,s_h,s_ Y=Get_sizes(X, Y) 
Param = Init(s_X, s_h, s_Y) 
for k in range(0, nbr_itr): 
ch = fd_Propagation(X, Param) 
A2=ch['A2'] 
cost = cost_fct(A2, Y, Param) 
gradients = backward_propagation(Param, ch, X, Y) 
Param = Update(Param, gradients) 
if print_cost_fct and k % 500 == 0: 
print ("Cost after iteration %k: %f" % (k, cost)) 


return Param, s_h 


Now we can run the mode! with the Iris dataset: 
Gradient_descent(X, Y, nbr_itr=500, print_cost_fct=False) 


Instead of developing a multi-layer neural network from scratch, library 
sklearn can be used. 

import pandas as pd 

from sklearn import datasets 

Iris = datasets.load_iris() 

X=Iris.data 


The Iris versicolor and verginica are very similar and cannot be 
distinguished. We merge these two species as a single class as we did before 
by assigning an index 1 if it is 1 to steosa Iris follower and 0 otherwise: 


Y= (Iris.target==0).astype(np.int8) 


We divide the data onto 2 sets for fitting and evaluation of the accuracy of 
the neural network: 
from sklearn.model_selection import train_test_split 
Iris_train, Iris_test, Species_train, Species_test = train_test_split(X, 
Y, test_size = 0.30) 


We create a neural network as a multi-layer classifier using the function 
MLPCLassifier (). We specify the size of hidden layers, i.e., a number of 
nodes of each layer, how many hidden layers in the network as well as the 
number of maximum iterations for the gradient descent algorithm: 

from sklearn.neural_network import MLPClassifier 

ANN = MLPClassifier(hidden_layer_sizes=(10, 10, 10), 

max_iter=1000) 
Here we created multi-layer classifier with 10 hidden layers and maximum 


iteration number of 1000. 


We can then train the artificial neural network and make predictions: 
ANN.fit(Iris_train, Species_train) 

The training function returns : 

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', 

beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, 

hidden_layer_sizes=(10, 10, 10), learning _rate='constant’, 


learning rate_init=0.001, max_iter=1000, momentum=0.9, 


n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, 
random_state=None, shuffle=True, solver='adam', tol=0.0001, 


validation_fraction=0.1, verbose=False, warm_start=False) 


We can see from the output of the fit function that by default, the activation 
function used in the neural network we developed is the ReLu function with 
the learning rate parameter of 0.0001 (i.e., alpha parameter). The optimizer 
used by default is ‘adam.’ The adam algorithm is a new optimization 
algorithm developed for training neural networks. This procedure adjusts 
the learning rate during optimization. We can change the solver algorithm 
when we create the multi-layer classifier as follows: 

ANN_sgd = 
MLPClassifier(hidden_layer_sizes=10,solver='sgd' learning rate_init= 
0.01, max_iter=500) 


Here, we created the multi-layer classifier object with the stochastic 


gradient descent(sgd) algorithm and learning rate equal to 0.01. 


Fitting this multi-layer neural network: 
ANN_sgd.fit(Iris_train, Species_train) 
returns : 


MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, 
beta_2=0.999, early_stopping=False, epsilon=1e-08, 
hidden_layer_sizes=10, learning_rate='constant’, 
learning_rate_init=0.01, max_iter=500, momentum=0.9, 
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, 
random_state=None, shuffle=True, solver='sgd', tol=0.0001, 


validation_fraction=0.1, verbose=False, warm_start=False) 


Now let’s predict the Iris flower species in the test dataset using both multi- 
layer neural network fitted with adam and stochastic gradient descent 
algorithms 
Y_adam = ANN.predict(Iris_test) 
Y_sgd=ANN_sgd.predict(Iris_test) 


The predicted output with both classifiers: 
print(‘Predicted species with adam algo:', Y_adam) 
print('Predicted species with sgd algo:', Y_sgd) 
is: 
Predicted species with adam algo: [11000000011000110000000101000110100 
00 
00101011] 
Predicted species with sgd algo: [110000000110001100000001010001101000 


0 
00101011) 


Finally, we evaluate the accuracy of the artificial neural network trained 
with adam optimization algorithm using the confusion matric that we 
defined in the example 2 and the accuracy score: 
from sklearn.metrics import confusion_matrix 

print(‘'Confusion matrix is': confusion_matrix(Species_test, 
Y_adam)) 

from sklearn.metrics import accuracy_score 

print(‘Accuracy of the model: %.2f' % accuracy_score(Species_test, 
Y_adam)) 


After running these statements, they return: 
Confusion matrix is [[30 0] [ 0 15]] 
Accuracy of the model: 1.00 


Now let’s compare this confusion matrix with confusion matrix of the 
model optimized with adam algorithm and evaluate how accurate the 
artificial neural network is with the stochastic gradient descent algorithm: 
from sklearn.metrics import confusion_matrix 
print(‘Confusion matrix is': mconfusion_matrix(y_test, Y_sgd)) 
from sklearn.metrics import accuracy_score 
print('‘Accuracy of the model: %.2f' % accuracy_score(y_test, 
Y_sgd)) 


Running these statements provide the following output: 
Confusion matrix is [[30 0] [ 0 15]] 
Accuracy of the model: 1.00 
We note here that results are similar regardless of the solver or the 
optimization algorithm used. The model is 100% accurate. We obtained 
similar accuracy with the perceptron in the previous example. The accuracy 
obtained in these examples is good to given the small size of the dataset 
(i.e., only 150 of Iris samples). This example raises the issue around 
overfitting artificial neural networks. We used 70% of the dataset to train 
the model and 30% to test the ability of the model to make a good 
prediction. Let’s change the size of the training and test datasets and see if it 
would affect the model accuracy. For example, we divide the Iris data in 
half: 
from sklearn.model_selection import train_test_split 

Iris_train, Iris_test, Species_train, Species_test = train_test_split(X, 
Y, test_size = 0.50) 


Then we follow the same steps. Because Adam algorithm and the stochastic 
gradient algorithm yield to similar results, we will create a neural network 
that is multi-layer using stochastic gradient descent algorithm as a solver for 
training the model: 

ANN_sgd = 
MLPClassifier(hidden_layer_sizes=10,solver='sgd' learning rate_init= 
0.01, max_iter=500) 


Now, we train and predict using the new model: 

ANN_sgd.fit(Iris_train, Species_train) 

Y_sgd=ANN_sgd.predict(Iris_test) 
Finally, we evaluate the confusion matrix and the model accuracy: 

from sklearn.metrics import confusion_matrix 

print(‘'Confusion matrix is': mconfusion_matrix(Species_test, 
Y_sgd)) 

from sklearn.metrics import accuracy_score 

print(‘Accuracy of the model: %.2f' % accuracy_score(Species_test, 
Y_sgd)) 


The statements above return: 
Confusion matrix is [[47 0] [ 0 28]] 
Accuracy of the model: 1.00 


The new trained neural network on half of the data has similar accuracy and 


precision of the neural network trained on 70% of data. 


Conclusion 


Thank you for making it to the end of Machine Learning With Python. Let 
us hope that it was informative and able to provide you with all of the tools 


you need to achieve your goals whatever they may be. 


The objective of this book is to present an introduction for the absolute 
beginners to machine learning and data science. The book presents the 
reasoning behind machine learning and its methods as well as a guide for 


using Python to apply those methods. 


This book covers the major machine learning paradigms, namely 
supervised, unsupervised, semi-supervised, and reinforcement. The book 
also covers artificial neural network principles in a separate chapter. This 
book explains how to develop machine learning models in general and how 
to develop a neural network which is a particular method of performing 


machine learning. It teaches how to train and evaluate their accuracy. 


Python is a widely used programming language for different applications 
and in particular for machine learning. This book covers the basic Python 
programming as well as a guide to use Python libraries for machine 


learning. 


This book presents machine learning applications using real datasets to help 
you enhance your Python programming skills as well as machine learning 
basics acquired through the book. These applications provide examples of 


developing a machine learning model for predictions using linear 


regression, a Classifier using logistic regression and artificial neural 
network. Through these applications, examples of data exploration and 


visualization using Python are presented. 


Machine learning is an active research subject, in particular, artificial neural 
networks. Nowadays, machine learning is used in every domain, such as 
marketing, health care systems, banking systems, stock market, gaming 
applications, among others. This book’s objective is to provide a basic 
understanding of the major branches of machine learning as well as the 
philosophy behind artificial neural networks. The book also aims at 
providing Python programming skills for machine learning to beginners 
with no previous programming skills in Python or any other programming 


language. 


Once you have acquired the skills and understood the reasoning behind 
machine learning models presented in this book, you will be able to use 
these skills to solve complex problems using machine learning. You will 
also be able to easily acquire other skills and use more advanced machine 


learning methods. 


Finally, if you found this book useful in any way, a review on Amazon is 


always appreciated! 
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Introduction 





Congratulations on downloading Deep Learning with Python, and thank 


you for doing so! 


The following chapters will discuss all of the different things that we need 
to learn how to do when it comes to starting out with deep learning. 
Machine learning has been taking off in the technology world, helping to 
create programs and to get computers to think and react in ways that may 
have seemed impossible in the past. Deep learning is going to take this 
whole idea a bit further and will help to teach the computer to learn through 
repetition of the same action, in a manner that is similar to what we will 


find humans do when they are learning as well. 


This guidebook is going to take some time to explore all that you are able to 
do with the deep learning algorithms, and it will spend a lot of time 
exploring each of the three main Python libraries that come with deep 
learning, as well as what you can do with them to make deep learning work. 
But first, we are going to explore a bit more about deep learning and how it 
works, the importance of the neural networks with this process, and some of 
the basics of Python so that we are ready to go and start with our coding in 


no time. 


From there, we are going to get a nice introduction to our three main Python 
libraries that will go so far in helping us to explore deep learning. We will 
get a sense of TensorFlow, Keras, and PyTorch—what they’re about, when 
you would use each one, and even how to install on your computer—so that 
you are ready to go. Before we move any deeper into these algorithms and 
what we are able to do with them, it is recommended that you install all 
three libraries. They often work in an interconnected manner, and having 
them ready to go and installed can make creating your own deep learning 


models that much easier. 


The rest of the guidebook is going to focus on what you are able to do with 
each of these three libraries to get the most out of deep learning. We will 
start with some of the basics of TensorFlow before moving on to using it in 
deep learning. Then, we will explore some of the unique and different 
things that you are able to do with the Keras and the PyTorch libraries as 
well. While many people like to focus on TensorFlow when it comes to a 
lot of the basics of machine learning, Keras and PyTorch are definitely two 
libraries that can enhance your machine learning models—which is why we 


are going to spend some time learning about them as well in this guidebook. 


There is so much to enjoy when it comes to working with deep learning and 
using the Python language to make the process easier. When you are ready 
to learn some of the algorithms and models that you can create with the 
help of the Python libraries, even with deep learning, make sure to check 


out this guidebook to help you get started! 


There are plenty of books on this subject on the market—thanks again for 
choosing this one! Every effort was made to ensure it is full of as much 


useful information as possible. Please enjoy! 


Chapter 1: What Is Deep Learning? 


Before we can go into all of the different things that you are able to do with 
deep learning, we need to spend some time understanding what this deep 
learning is all about, as well as why it is so important to learn about. Deep 
learning is going to be a type of machine learning. In particular, it is going 
to concern itself with teaching computers that are natural to humans, such 
as learning by example. Thanks to the work that has been done with deep 
learning—great projects like the driverless car are now possible because a 
car is able to distinguish between pedestrians and lampposts and can even 


recognize when it is near a stop sign. 


Deep learning opens up a lot of possibilities for us. In the past, if we wanted 
a computer or another piece of technology to do something, we had to be 
able to code enough that it learned in that manner. This works for a lot of 
different projects and processes, but it may not be the full solution that we 
are looking for. Deep learning, along with some of the other parts of 
machine learning, works to bring this a bit further and lets the computer or 


machine learn on its own, without an exact code telling it what to do. 


Deep learning is going to be inspired by the structure of our human brain 
and how it is able to operate. With deep learning machines, we are going to 
be able to perform tasks that often are thought of needing human 
intelligence to perform well. The machines will be able to take some of the 
experiences that they have—and from that, they can acquire skills, with the 


intervention of humans. Deep leaming is going to involve a few different 


processes—but in particular, it is going to involve the use of artificial neural 
networks in order to learn patterns, trends, and relations from the data sets 


that it receives. 


Humans are going to learn any time that they do a task on repeat. Ina 
similar manner, the algorithms for deep learning are going to perform tasks 
repeatedly while making some minor tweaks in order to get the outcome to 
improve. With deep learning, the models on the computer are able to 
perform classification tasks from sound, images, and even text. The deep 
learning models are able to provide us with a higher level of accuracy, and 
if it is set up in the proper manner, it is going to be able to outperform the 
work that humans are able to do. Training of these models is going to be 
done with the help of large sets of labeled data, and then there will be neural 


networks that are done over several layers. 


Most methods available for deep learning, as we will discuss in more detail 
as we go through this guidebook, are going to rely on neural networks. This 
is why most of the models that are found in deep learning are going to be 


given the name of deep neural networks. 


The word “deep” in this term is going to refer to the number of hidden 
layers that this neural network will provide. It is likely that a neural network 
is going to have at least two to three of these hidden layers—but with the 
type of models that we will be working on in this guidebook, it is likely that 


there are 100 or more hidden layers in a deep learning model. 


Being able to train these models is going to take more time and effort, 


mainly because they have so many hidden layers that need to be peeled 


back and explored. The training process is going to involve the use of large 
sets of data. Neural network architectures with no need for manual 


intervention will then be used in order to extract patterns from the data. 


Let’s take a look at an example of how these deep neural networks will 
work. A good example of one of these neural networks is going to be the 
convolutional neural network or CNN. The CNN is going to use 2D 
convolutional layers, which is going to be great when you want to process 


any kind of 2D data, including any images that you have. 


When you work with the CNN, there is not going to be a need for extracting 
features in a more manual way because you are not going to be required to 
identify the features that need to be used for these classifying images. 
CNN's are able to directly extract the different features that they need out of 
these images. The ability that these models of deep learning have to extract 
features in an automatic manner makes them suitable to use in several 


problems, including object classification. 


The CNNs are going to end up relying on numerous layers to help them 
figure out the features that are present in an image. The complexity of the 
image features will increase with every layer. For example, when we are in 
the first hidden layer, the edges of the image may be detected, and then it 
won’t be until the neural network gets to the very end that it can tell us what 


is in the complex feature of that image. 


A good example of using deep learning and some of its models in the real 
world would be fraud detection systems. Once the system has been able to 


learn the procedures that it needs to follow, any kind of anomaly that comes 


through is going to be easily detected and can be classified as a potential for 
fraud that needs to be checked out. 


It is important to see that machine learning and deep learning are going to 
be connected, but they are still a bit different. Deep learning is basically 
going to be a subfield of machine learning, which means that it is a type of 
machine learning, but it takes things a bit further than some of the other 
algorithms of this type are able to do. Deep learning is going to concern 
itself with a lot of the algorithms that are inspired by the structure as well as 


the function of the brain that is known as an artificial neural network. 


We are going to spend some time talking about these neural networks and 
what they are all about, how to work on one of your own and more, as we 
progress through this guidebook because they are so important to the ideas 
that come with deep learning. But suffice it to say that these neural 
networks are going to be so important in helping you to get the results that 


you want with your own codes. 


If you are new to programming and you have not had any experience with 
neural networks or with deep learning, this is something that you may be 
confused about. But in many cases, it is easy to see that deep learning can 
often be compared to some large neural networks. Andrew Ng, who is 
known as the Chief Scientist at Baidu Research, which was the entity that 


formally founded Google Brain, was able to explain this a lot better. 


In some of the earlier talks about this process of deep learning, Andrew was 
able to describe some of the factors about deep learning in the context of a 


traditional artificial neural network. The idea of deep learning, according to 


Andrew is to use brain stimulations in the hope to make the algorithms for 
learning better and a lot easier for programmers to use and to make some 
big advances in the fields of AI (artificial intelligence) and machine 


learning. 


Because of how fast technology is changing in our modern worlds, and how 
fast the computers and other machines are now, more algorithms and 
processes with deep learning are available and usable now than they ever 
were before. And there is actually enough data available to use now so that 
we can properly train some of the larger neural networks that are a part of 


deep learning. 


Just like with some of the other parts of machine learning that you may 
have worked with in the past, you will be able to do a lot of things when it 
comes to deep learning and the good programs and more that can happen. 
Some of the current applications that are available with some of the 


algorithms of deep learning, if you choose to use them, include: 


1. Speech recognition that is automatic 


2. Image recognition 


ice) 


. Visual art processing to figure out what is in the image, if it is a 
real copy, and more 

. Language processing 

. Toxicology, as well as other types of drug discovery 

. Customer relationship management 

. Systems for recommendation 

. Bioinformatics 

. Military 
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10. Financial fraud detection to keep consumers safe and to 


save banks and other financial institutions a lot of money 


11. Restoring images 
12. Advertising on mobile devices 
13. Medical image analysis 


These, of course, are just a few examples of what you are able to do with 
the idea of deep learning. And as more programmers discover this type of 
learning and start to use it with their own coding, and more technology is 
developed over time, you will find that it is likely more and more 


applications are going to be added to that list! 


Looking More at the Artificial Neural Networks 


We will talk about these networks a bit more in a later chapter, but first, we 
need to take a look at the artificial neural networks that are available. These 
neural networks are going to be pretty much the core of deep learning. A 
neural network is going to be a kind of network in which the nodes are 
called the artificial neurons. The concept of the neural networks began back 
in the 1980s. The network that comes inside of humans is going to be made 
up of interconnected neurons that are responsible for maintaining a high 
level of coordination to receive, and then later transmit, messages to the 
spinal cord and to the brain. In machine learning, we are going to mimic 
what happens inside of the brain, and then this kind of network is going to 
be known as ANNs or Artificial Neural Networks. 


Artificial Neural Networks are going to be made up of various neurons that 


have been created in an artificial manner. These are then going to be 


programmed and taught to help them adapt to some of the cognitive skills 
that we see in humans. Some of the applications of these ANNs are going to 
be things like anomaly detection, time-series predictions, voice recognition, 


soft sensors, and image recognition, to name a few. 


Neural networks are going to be represented as a form of a mathematical 
model, and they are mostly going to be used when you are doing something 
with machine learning. You will notice that they are made up of neurons 
that are all connected with one another because these neurons are allowed 
to send signals back and forth. The neuron will be able to receive the signal 
until it ends up meeting the threshold, and this is the time when it is going 
to be able to forward the signals over to the next connected neuron in the 
network. 


Think of your years in school when you played the game of telephone 
(without all of the funny misinterpretations of the words). Someone would 
Start the game, and then whisper some word or a small phrase into the ear of 
another person. When the message was all done, they would then relay 
what they had heard in the ear on the other side of them. This pattern will 
keep going down the line until the last person (or neuron in this kind of 


network), hears the message and relays it out loud in the game. 


The people in the game are not going to pass on the message until they hear 
all of it. They won’t just hear one word or two words for a phrase because 
they know this is the wrong information, and the game is not going to finish 
well. Instead, they let the person before them complete the whole message, 
and then they relay it on as best as they can. This is similar to what we are 


going to see with the neurons sharing messages in a neural network. 


Of course, there are times when the communication is not going to end up 
so well. Part of the fun of the telephone game is that there are going to be 
some parts that are not heard well, and the last person is not going to get the 
answer right and will come up with something silly in the end. When we are 
working with a neural network though, this is not something that we really 
want to happen. We want the neurons to pass on the information and the 
message as effectively, and accurately as possible. If the deep learning 


algorithms were used in the proper manner, then this will happen. 


The connections between the neurons can be made in any manner that you 
would like, even back to the same neuron, but the problem is going to show 
up when it is time to do some training to the network. This helps us to 
understand why restrictions have to be imposed in the creation of these 


neural networks in the first place. 


When you are working with a multi-layer perceptron, the neurons are going 
to be arranged out in layers, and during this time, the neurons are going to 
be allowed to pass signals only to the next neuron in the layer. You won’t 
see any of the neurons jumping around and not completing the work that 
they should. The first layer that you will use is going to be the input layer, 
and then the last one is going to be the output layer, and this is going to 


have a value that is predicted. 


As we go through the artificial neural network, learning is going to refer to 
any process of modifying the bias and the weights that are being fed into 
the network. Learning in these networks is going to be facilitated by 


training it, while a certain set of inputs are fed to the network while 


expecting a particular output in the end which is going to be called the 


target output. 


When you do this, it is going to call for adjusting the values for the weights 
and the biases so that they are able to give you the target output. The 
process of doing some learning in this kind of network can be compared 
back to the learning that most humans are going to do. The training is done 
over and over again until we get the value of weights that will give us a 
targeted output. After each set of training, normally referred to as the epoch, 
the weights will be adjusted in a way that the error of the neural network is 
then reduced. This will continue on until the error is all the way gone and 


the program is able to perform the way that it should. 


Why Use Python to Help with This Kind of 
Learning? 


Now that we have a bit better idea of what deep learning is all about (don’t 
worry, we’ll take some more time to learn about this as we go), and we have 
taken a look at the artificial neural networks a bit, it is time to look at the 
Python language a bit. When it comes to using machine learning, especially 
with deep learning as the focus, you will find that a lot of people will want 


to work with the Python language. 


There are a lot of reasons to really like to use the Python language. Sure, 
you can do deep learning with the help of coding languages, and if you have 
another one that you already know how to use or one that you prefer, then 


go ahead and work with that. But for the most part, especially with those 


who have never done much with coding or machine learning at all, Python 


is one of the best languages to use for this. 


Python is an open-sourced coding language that is available to use 
throughout the world. The fact that it is open-sourced means that it is free to 
use. You can pay for some extra libraries and features if you want to get 
them from a third party, but for the basic parts of the programming 
language, and to get started with some of your own codes, you can do this 


all for free. 


The Python code is also really easy to learn how to use. Compared to a lot 
of the other coding languages out there, Python will not take a lot of time to 
learn. The language was designed with a beginner coder in mind, so it 
makes sense that the language would be simple, clean, and based on the 
English language with words that are easier to understand. If you have 
stayed away from some of the neat things that you are able to do with 
machine learning and deep learning because you thought the coding would 
be too difficult, then Python will be the language to use that will set your 


mind at ease. 


Even though this is a really simple coding language to work with, and many 
beginners like to use this as their springboard for coding, there is still a lot 
of power. Think of all the power that you are going to need to use in order 
to do some of eh topics we will explore with deep learning. And yet, Python 
is the main coding language that programmers are going to turn to when 
they are ready to do this deep learning. This should tell you something 


about the amount of power that you are able to get with this language. 


Another thing to enjoy here is the library that comes with Python. Just by 


downloading the software that comes on the www.python.org website, you 





will have a lot of capabilities that you will be able to do with the Python 
code, and these are just considered the basics. If you have never worked 
with Python in the past, take some time to look through some of the basic 


libraries that come with Python to see what it can provide to you. 


When you want to do some deep learning, though, there are going to be 
some capabilities that the Python language is just not going to be able to 
handle for you. This doesn’t mean that you are going to be lost and won’t 
be able to do the work that you want. It just means that we need to take a 
look at some other libraries to add into this language. There are several 
libraries that work well with Python and help you to code in this language 
with some deep learning as well. We will explore some of these in this 


guidebook so that you know exactly how to let these works for your needs. 


Since this language is going to be so popular and used throughout the 
world, it makes sense that it has a large community. Even when you are 
working with some deep learning here, you will be able to find other people 
who are doing the same thing, and who have the same questions as you and 
this can ensure that you will be able to get the help that you need anytime 


that a code isn’t working or you get stuck. 


While there are other coding languages that you can work with, Python is 
generally the go-to option when it comes to creating programs and working 
with codes that need to do deep learning. In fact, Python works well will 
pretty much any of the machine learning algorithms that you want to work 


with on your program. You will see as we go through this guidebook that 


the Python language is going to be simple to learn, has all of the power that 
you need even for deep learning, and will be able to get the work done 


without the hassle of the complicated coding along the way. 


What Do I Need to Know About the Python 
Code? 


We are going to take a look at a lot of different types of codes that you can 
use in deep learning in this guidebook. This is one of the best ways to learn 
how to work with deep learning and will make sure that you are able to do 
some work with machine learning in each of the libraries that we will 
discuss. With this in mind, though, we need to have some idea of how the 
Python code is going to work so that we are able to use it in the proper 


Manner. 


First on the list is the idea of the keywords in this language. These may not 
seem like that big of a deal, but each coding language, and that includes 
Python, is going to rely on these keywords in order to tell the compiler what 
you would like to have it do. These are pretty simple, but you need to make 
sure that any of the words that are reserved and saved aside as keywords in 
this language stay that way, and are not used in any other part of the code. If 
you do use them in the wrong manner, then you are going to have some 
issues because the compiler will think a command has come up, and won’t 


know how to respond in the process. 


Statements are another important part of the code that you are able to focus 
on. These are pretty easy, and they can come in a lot of different shapes and 


sizes in the code. If you are going to have any part of the code show up on 


the screen, then that part is the statement. We are going to take a look at 
some of the ways that you can test whether or not the deep learning libraries 
have properly downloaded on your computer later, and the message that 
shows up on the screen when you do this, that also shows that the library 


was downloaded properly will be the statement. 


Comments are another part of the code that we need to spend a bit of time 
on. These are not too important simply because you could write a full code 
without any of these. But they do help to clean up the code, can explain out 
a few of the things that you will be doing in the code, and can make it easier 
for you and other programmers to know what is going on. You are allowed 
to add in as many of these comments to the code as you would like, and the 
compiler will know that you just skip over them without interrupting the 
code at all. The most common practice in coding here is only to use the 


comments as you need them so that your code is still easy to read. 


As you go through a few of the different codes that we are going to work on 
in this guidebook, you will be able to notice a few examples of these 
comments. You can easily add in any comment that you want, and make it 
any length that you would like, by adding the # sign before your comments. 
Doing something like “#this is a comment” will make sure that the compiler 
knows you are just leaving a little message, and that it should just jump 
over that part of the code before moving on to the next command that you 


need. 


The next thing that we need to take a look at when we work with the 
different parts of the Python code is how you can name your identifiers. We 


will look at a few of these identifiers below, especially when it comes to the 


TensorF low library, and there are a couple of different ones that you can 
work with. You have to make sure that they are named in the proper manner 
so that you can call them up when needed and make sure that they work the 


proper way in your code. 


Working through the Python code can show you that there are many types 
of identifiers that you may pull out. Some of the different names that you 
may see these identifiers under include classes, functions, entities, and 
variables. While they do come under a variety of names, you will find that 
they are going to follow the same rules for naming them, no matter what 
kind of identifier you are working with. Once we list all of the different 
rules (they really are not as hard as they seem), below, you will be able to 


use them to give any identifier you want a name. 


This is going to bring us to our main point: how to name these identifiers 
and how to give them the right name to see results. The first thing that you 
have to pay attention to is whether you may accidentally use a keyword as 
the name or not. Never use this because it is just going to cause the 
compiler to get confused and can make it difficult to get the program to 
work well. But most other names are going to be fine. You are able to pick 
any upper case and lower-case letters that you would like, as well as 
numbers and the underscore symbol to help you give the identifier the name 
that you want. You should also pick out a name that is easy to remember 


and will make sense for the part of the code you are working on right now. 


But there are a few restrictions to keep in mind with this as well when you 
start naming your identifier. First, it is not allowed for you to name any 


identifier with a number, and the name should not have any spaces that 


come with it. Naming the identifier something like 5 kids or 5 kids would 
get you an error, but naming it five kids or five_kids would be just fine. 
And keep in mind that you should never use a keyword as the name of one 


of your identifiers or the compiler is going to get confused. 


When you come up with the name that you want to give to that identifier, 
make sure that you remember what it is. It may follow all of the rules that 
you need, but if you are not able to remember the name when it is time to 
execute the code or pull out that identifier later on, then there can be some 
issues. If you call it the wrong thing or spell it differently, then there could 


be an error or the compiler is going to get confused. 


And the final thing that you may start to notice when you go through some 
of the codes in this guidebook is that there are some operators. These are 
going to be pretty simple, but the amount of work that they are able to do in 
your code will be amazing. The operators are going to come in a lot of 
different forms, and the ones that you choose to use will often depend on 


what you would like the code to do when it is all done. 


There are many types of operators that the programmer is able to work 
with. You can use assignment operators that make sure that a value is 
assigned over to your variable. There are Boolean operators that determine 
whether a statement or a part of the code is true or false. There are 
arithmetic operators that can add or divide or multiply and subtract different 
parts of the code from one another. And there are even comparison 
operators if you want to compare and see whether or not two parts of your 


code are the same, similar, or completely different. 


As we go through a lot of the different codes that work with our three main 
libraries in deep learning, you are going to find that operators, as well as 
these other basic parts, are going to show up on a pretty regular basis. And 
often you will start to use these parts without even realizing what is going 
on or how you are doing that. But it is still important to know some of the 


basics that come with your coding so you can really put it to work for you. 


The above will be just a few of the basic parts that show up in your Python 
code. While this is just a quick introduction to what you are able to do with 
your coding in Python, it gives us some background that we can use as we 
move on to some of the other parts. As we go through the TensorFlow, 
Keras, and PyTorch libraries, you are going to starts seeing a lot of these 
basic components, and more, show up in your code, and when you learn a 
bit about them now, it can make the codes seem more familiar and easier to 


use. 


Chapter 2: Exploring the Best Python 


Libraries for Deep Learning 


As we go through this guidebook, we are going to use three primary 
libraries to help us build up and work with our own deep learning models. 
The three that we are going to focus our attention on will include 
TensorFlow, Keras, and PyTorch. Let’s take a look at what each one is able 
to do for us, as well as how it will help with some of the deep learning 
models that we want to create before we move on to installing them and 


actually working with them! 


TensorFlow 


The first Python library that we are going to focus on is going to be 
TensorFlow. This is going to be a framework and library that is from 
Google that can be used to help us create our own deep learning models. 
TensorFlow is going to rely on a lot of data flow graphs to help with 
numerical computations. It is one of the first libraries that most 
programmers are going to focus on when they want to do some coding in 
machine learning because it really does make the process easier. In fact, you 
can use this library to help with things like predicting future results, making 
modifications easier, training the models of machine learning, and acquiring 


needed data. 


This particular library was first developed by the Brain team from Google 
and was used to help with machine learning that was larger in scale. This 
library is going to bring together algorithms that work well with not only 
machine learning, but also with deep learning, and it is going to make them 


more useful via a common metaphor. 


TensorFlow is going to use Python to make sure that the users get a front- 
end API that can help them build up their applications. But then this is 
going to be changed around because any applications that are built in this 


library are going to be executed using the high-performance C++. 


TensorFlow is going to be used to help build, train, and run deep neural 
networks for a lot of different processes, which is going to make it perfect 
when you want to do some of the different models that show up in deep 
learning. You may choose to work with this library any time that you want 
to build, train, and run deep neural networks to help with natural language 
processing, word embedding, recurrent neural networks, handwritten digit 


classification, and image recognition. 


TensorFlow is also going to help its users to create graphs of the flow of 
data, which are going to be structures that can help to describe the flow of 
the data between graphs or processing nodes arranged out in a series. When 
you see a node, know that when it shows up in the graph, it is going to be a 
representation of some kind of mathematical operation, and every edge that 
is able to connect the nodes is going to represent a tensor in the data array 


that is multidimensional. 


All of these are going to be provided to any user of this library, thanks to 
the Python programming language. You are going to love how easy it is to 
learn Python, especially because it is going to provide you with an easy 
method to use to understand how high-level abstractions can be put 
together. A good thing to remember when you are working with 
TensorFlow is that all of the nodes and tensors that we talked about above 
are going to be objects in Python, and all of the applications that show up in 


TensorFlow are going to be Python applications as well. 


However, it is hard to do any kind of math operations in Python, which 
means that we need to work with something else to make this happen. This 
is why the TensorFlow library is going to use transformations that are 
written in C++ binaries. The work of Python in this library is to direct the 
traffic between them and provides abstractions to make sure that the two 


languages are going to connect back to one another. 


Even though you are using two different types of coding languages to get 
the work done, you will find that any of the applications that are used with 
this library are able to be used on any kind of convenient platform that you 
would like. This means that you will be able to use these applications in 
iOS, Android, GPUs, CPUs, local machine, and cloud cluster devices. The 
models that are created can be used on many different devices, based on 


what works for you, to help you make the right predictions from the data. 


One of the best advantages that you will see when you decide to use the 
TensorFlow library is that it offers some abstraction. Instead of just looking 
at some of the details that are considered low-level regarding the 


implementation of algorithms, or looking for the right way to channel the 


output from a function as input to another function, the developer is allowed 
to focus on the entire logic concerning the application under development. 


The library is going to take care of the rest of the work for you. 


In addition to all the things that we have talked about above, TensorFlow is 
able to provide developers with some additional benefits when it is time to 
debug any applications that you have written. It has the eager execution 
mode that will make sure that you are able to evaluate every operation of 
the graph in a transparent, and even separate manner, rather than having to 
make up the whole graph as a single object and then analyzing it as you go 


through the whole process. 


Keras 


Now that we have had a chance to take a look at the TensorFlow library, it 
is time to take a look at a second library that is going to help you get the 
work done. And this is the Keras library. Keras is going to be another 


library for deep learning that is written in Python. 


When you use the Keras library, you will find that it is used best for the 
development of any neural networks that you need to use, and it is going to 
run on top libraries that are out there including CNTK, TensorFlow, and 
Theano. During its development, it was meant to be used when you needed 


to do experimentation quickly. 


This particular library was first developed to be used for human beings, 
rather than by machines, which means that compared to a lot of the other 


libraries out there, it is going to be more user-friendly overall. This library 


is going to give high priority to the experience that the user is going to get 
when using the library. The user is only expected to take a few steps to get 
the action done, rather than having to write out complicated codes that take 


forever and have a lot of different parts to it. 


Keras is a good library to use because it can be extended pretty easily. This 
means that you will be able to add new models to this, and the existing 
modules that work with this library are going to come with a ton of 
examples that you will be able to use for your own coding and deep 
learning models and more. If you find that the capabilities that come with 
Keras are not as complete as you would like, it is simple to expand this out 


and get it to work for your needs. 


The ability of Keras to allow for the creation of new modules for 
expressiveness has made it one of the best Python libraries out there to use 
for any advanced research that needs to be done. Often when you are 
working with machine learning, especially with deep learning, you will 
need to have an extensive amount of research done on the given topic to get 


the work done. And the Keras library will help you to get this done. 


PyTorch 


We have now had a chance to work with two of the Python libraries that 
you are able to use when you need to do some deep learning models in your 
code. TensorFlow is one of the first options for libraries that people will 
choose to add to Python whenever they want to do any tasks or any 
algorithm that works with machine learning. There are a lot of capabilities 


that come with this library, and it is definitely worth your time to learn more 


about this and how it will work. From there, we moved on to Keras, which 
is a good library to work with when you want to develop some neural 


networks that go with your deep learning algorithms. 


Now, it is time for us to take a look at our third library. This one is going to 
be known as PyTorch, and it is going to bring us something new and 
exciting to make sure that we see the best out of any deep learning that we 


decide to work on in this book. 


PyTorch is going to be a library that is again based on the Python language, 
and it was developed specifically to implement some more flexibility when 

you work on developing new models of deep learning. It is going to have a 

workflow that will resemble Numpy, which is a scientific computing library 
that works well with Python. If you have used the Numpy library in the 


past, then this will be a welcome relief. 


The PyTorch library was first released at the beginning of 2016, and many 
programmers have decided to start using it to help them build up some 
neural networks. It is also a popular choice because it is really easy to use 
and won’t take as long to learn as TensorFlow or some of the other libraries 
that are based on the Python language. 


PyTorch is going to rely on the Eager/Imperative paradigm. Each line of 
code that is needed to build up a new graph is going to define a component 
of the graph. Computations are going to be performed on their own on these 
components itself, even before we are done with doing the whole graph. 


This methodology is going to be known as “define by run” and can be 


really useful in defining your graph and getting that graph to be built up in 


the proper manner for your model. 


PyTorch is often known because of its dynamic computation graphs. It is 
going to come in with a framework which we can use in order to build up 
some computation graphs, rather than the predefined graphs that have 


specific functionalities already built-in. 


The reason that we want to work with the computation graphs, rather than 
some of the other options, is because they can be changed, even when you 
are in your runtime. Such a feature is going to be really useful when we are 
uncertain about the amount of memory that we will end up needing in the 
model to create a neural network. The changes can be made as we need, no 
matter how much space the neural network needs, or how long it takes you 


before you find out how much space these networks need. 


These three libraries are going to be important as we go through some of 
the models that come with deep learning throughout this guidebook. Each 
of these is going to work in different manners and will let you do different 
types of models within the deep learning world. This can make it so much 
easier to get things done, to teach the computer how to behave, and to make 


some of the programs that you would like. 


As we go through this guidebook a bit more, you will start to see more 
about how these particular libraries are going to work, and when you would 


be able to use each one. 


Chapter 3: Installing and Setting Up the 
Libraries That You Need 





Now that we know a bit more about the three libraries that we are going to 
use in this guidebook, it is time to look at the steps that we need to take in 
order to get these libraries set up and ready to go. We are going to go 
through each library on its own, as each of them is their own separate 
entities, and you will need to make sure that they get set up one at a time. 
Hence, let’s get started and make sure that we get the three libraries that we 
talked about above set up and ready to use on your computer! 


How to Install TensorFlow 


The first library that we are going to look at downloading on your computer 
is the TensorFlow library. When you download this library, you will notice 


that it comes with a variety of APIs for programming languages, including 
Rust, Java, Go, Haskell, and C++. It is also going to have a third-party 
package for R, which is known as TensorFlow that you are able to use as 
you would like. 


We are going to go through the process that you need to use in order to 
install this program on Windows, but the process that you use for 
downloading this library on a few other operating systems, like Mac and 
Linux, will be pretty much the same. When you are on a Windows 
computer, you will be able to install TensorFlow with either pip or 


Anaconda. 


The native pip is going to make sure that you are able to install this library 
on your system without needing to go through and use a virtual 
environment. If you do this, though, the installation of this library with a 
pip can sometimes interfere with some of the other installations that you did 
with Python on your system—the good thing to remember here is that you 
only need to run a single command, and then the program will be up and 
running on your computer. In addition, when you choose to install 
TensorFlow with a pip, the user will be able to run all of the programs from 


any kind of directory that they choose to use for programming. 


To install this library with the help of Anaconda, you will need to go 
through and create a virtual environment in most cases. Inside of the 
Anaconda program itself, it is often recommended that you install this 
library with the help of the pip command rather than doing the command 
for the conda install. Both of them are going to work well, but the pip 


method is often going to be the easier one for you to work with. 


Before we get started, ensure that the Python 3.5 version is installed (you 
can also go with any Python 3 that is newer than that one), on the Windows 
system. When you are working with the Python 3 program, you will notice 
that it comes with the pip3 program, which can be used for the installation 
of the TensorFlow library. What this means is that we want to use the 
command of “pip3 install” for installation purposes. The command that you 
will need to use to make sure that you can install TensorFlow with the CPU 


only version includes: 


pip3 install — upgrade tensorflow 


The command that we have above should be run from your main command 
line. If you want to install this in a different way and go with the GPU 
version for this library, the command that you will need to use will be the 


following: 


Pip3 install — upgrade tensorflow-gpu 


The two commands above are going to help you get the TensorFlow library 
set up on your Windows system. You may have to take some time to make 
this work, but you will need to give it a few minutes to make sure that you 


are set up and that everything has time to download properly. 


During this time, it is a good idea to verify whether the installation of 
TensorFlow was really successful or not. The best way to make this happen 
is to open up the command prompt in Python and then go through the 


following command sequence to check: 


>>>import tensorflow as tf 
>>>hello = tf.constant(‘Hello, this is TensorFlow!’) 
>>>ses = tf.Session() 


>>>print(ses.run(hello)) 


If you put this code into the system in the right manner, when you run it the 
screen should say “Hello, this is TensorFlow!”. When you have the code 
above printed out on your screen, then this means that the TensorFlow 
library is downloaded on your system the right way and you are able to use 


it in any manner that you would like. 


How to Install Keras 


The next library that we need to take a look at is Keras, and how to get this 
all set up on your system. You will find that the steps for this installation 
will be even easier than what we did with TensorFlow, simply because the 
only decision that has to be made is which preferred backend engine you 
want to use. Once you choose this engine, you will be able to go through 
and install the Keras library just like you would any other library on 
Python. 


You will find here that Keras is going to run on some other libraries, which 
include Theano, CNTK, and TensorFlow. What this means is that you have 
to make sure you pick out a good backend engine to work with—otherwise, 


this library is not going to work. 


Keep in mind that when you use this Keras library, it is not really meant to 


perform operations that are considered low level. What this means is that it 


gives you an advantage because you are able to create models that are of a 
higher level in layers, and will provide you information on more hidden 
layers than other libraries can. The low-level operations that you are able to 
do will really depend on the backend, which is another reason why this is so 
important. The libraries that you are able to pick out as part of the backend 


of Keras includes the following: 


1. TensorFlow: This is going to be an open-source framework that 
is going to be for symbolic manipulation of tensors. It was 
originally developed by Google. 

2. Theano: This is another open-source framework that is going to 
do the same thing with manipulating those tensors, but this one 
was developed by LISA Lab. 

3. CNTK: And the final library that you can choose as the backend 
is going to be CNTK. This is a toolkit that is open-sourced and 


was originally developed by Microsoft for deep learning. 


What this means is that we need to have one of the above to help you as the 
backend before installing the Keras library. It often depends on your own 
preference and what you are hoping to achieve with this process, but for the 
most part, the default is going to be TensorFlow, and that is what most 
programmers like to use. And since we already went through the steps on 
how to install this library on your system, that is going to save some time 


and make things a bit easier. 


Once you have the TensorFlow, or one of the other chosen backends, in 
place, you are able to install the library of Keras. The command that you 


will need to use to do this with the following command: 


Pip3 install keras 


This should be all that you need to help you install the Keras library. The 
default setting that you are going to have here is that the TensorFlow library 
is going to be your backend, which is what we just discussed. However, 
remember that you are able to choose another type of backend if you want. 


This can be done if you work with the set command. 


Let’s say that you want to get off the backend of TensorFlow because you 
don’t want to use it or you feel one of the other methods would be better. 
We are going to choose to work with the Theano as your backend. You will 
just need to use a simple command to switch up the backends to the new 


one. The command that you need to make sure this happens is: 


set “KERAS_BACKEND=Theano” 


You can then take a moment to check out whether the Keras library was 
installed successfully or not. This is just going to include a simple import 


command of “ import keras ”. 


How to Install PyTorch 


And the final library that we need to take a look at to make sure that it is 
installed on the computer is the PyTorch library. You can install this library 
on a number of operating systems including several of the Linux 


distributions, Mac and Windows. 


First, we are going to take a look at how to install this library on your 
Windows system. If you want to make sure that you are able to support 
CUDA with this library, you need to make sure that your Windows system 
is able to use the NVIDIA GPU. PyTorch can be installed on any Windows 
system that is Windows 7 and higher, and Windows 10 or above. You are 
able to install it on Windows Server 2008 r2 or above. Also keep in mind 
that when you use this library on Windows, you need to work with one of 


the versions of Python 3, not any of the versions of Python 2. 


For this book, we are going to use Python 3.5, and we are going to install 
this library with the help of a pip. You can then run the following 
commands from your terminal in the Windows operating system to make 


this all work: 


Pip3 install 
http://download.pytorch.org/whl/spu/torch-o.4. 1-cp35-cp35m-win_amd64.whl 








pip3 install torchvision 


The above code that we have is going to be helpful any time that your 
system doesn’t have a CUDA support to work with it. It is also possible to 
work with installing PyTorch through the Anaconda system if you want to 
work with a non-CUDA Windows system. With the Anaconda program in 
place, a sandboxed kind of environment is going to be created for you. You 
will just have to go through and use the following commands to make this 


work: 


conda install pytorch-cpu -c pytorch 


pip3 install torchvision 


The two commands should be enough to get the PyTorch library all set up 
for you. You should now verify whether the installation ended up being 
successful for you or not. When you are working with the prompt for 
Anaconda, you can just type in python in order to get to your Python 
terminal. You then need to spend some time typing in the following 


statements from the opened Python terminal to get started: 


from __future__import print_function 
import torch 
y = torch.rand(5,3) 


print(y) 


If this prints out what you are asking for, then you will know that the code 
is running in a successful manner. This means that the PyTorch library is all 


ready to go, and you will be able to use it for your needs. 


At this point, all three of the libraries are going to be in place and are ready 
to use. They should work well on your computer, and they are ready to do 
the work that you would like. You can spend some time looking at all three 
of these libraries to see how they work, what features they have available, 


and what all you will be able to do with the systems! 


Chapter 4: The Mathematics of the Neural 


Networks 


In this chapter, we are going to take a bit more time to look at neural 
networks, as well as how they work together to help you do some more of 
the tasks that you need in deep learning. Specifically, we are going to look 
at some of the mathematics that comes with these neural networks and what 


you are able to do with them. 


Before we get too much into the mathematics that comes with these neural 
networks, we need to stop and look a bit more at what these neural 
networks are all about. We need to see how they are going to work, why 


they are so important in the codes that we are going to work with, and more. 


As we go through these neural networks, we are going to see that they are a 
type of unsupervised machine learning, which fits in great with deep 
learning. When something is an unsupervised machine learning algorithm, 
it means that the program is going to be able to learn on its own, rather than 
being told a lot of examples and taught. This can be useful in a lot of the 
programs that you want to be able to do with deep learning and can make it 


a lot easier to write the kinds of codes that you want to write out. 


When we look at the neural networks, each of the layers that you go 
through will get the algorithm to stop and see if there is some kind of 


pattern found in the image that it is looking through. If the network isn’t 


able to find a new pattern once they go down another layer, then it will start 
taking the necessary steps to help it move on to the next layer. This process 
continues through one layer after another, until all of the layers are created. 

The program, if it is doing its job in the right manner, is going to be able to 

give you a good prediction back about what is inside the image that you 


scanned. 


There are a few things that could possibly happen when you enter this 
point, based on how the program is working. If the algorithm was able to go 
through the process above, and it did a good job at sorting through the 
layers, it will then provide you with a prediction. If the program is right in 
its prediction, the neurons of this system, just like the neurons in the brain, 


are going to become stronger. 


The reason that this works so well is that the program decided to work with 
artificial intelligence, which allowed it to make strong associations between 
the object and the patterns that it found. The more times that the system is 
able to look at a picture and come back with the right answer, the more 


efficient it is going to be the next time that you use it. 


To find out how this works will require us to look a bit closer at how the 
neural networks work together. For example, let’s say that you have set out 
to create a new program that is able to hold onto a picture, which will be 
your input. And then it can look through the various layers of that picture 


until it figures out that the image inside is a car. 


Now, if you have had a chance to actually write this in and it has been 


coded in the proper manner, then the program can make an accurate 


prediction that the picture in front of it is going to be a car. The program 
will then present the prediction with you, based on the features that it 
already knows about the car, the past experiences, and more. This neural 
network will be able to look at a lot of factors to help it determine that the 
picture in front of it is a car, but it will need to go through a lot of different 


layers to make this happen. 


While we may now have a better understanding of how this all works, it is 
time to see some of the mechanics that come with it. To see the neural 
networks actually work, you need to make sure that the program has, at one 
point or another, seen an image of a car that it can then compare to the 
newer image that you present. The neural network is going to take this 


learning picture and look it all over and learn from it. 


After you present what is going to be your learning picture, the neural 
network is going to start on the top layer. This may be the edges on the car 
to start with. Then, it will continue to go through every layer that is found in 
the picture, learning along the way so that the neural network can compare 
and contrast later when you have other pictures. The whole time, the neural 


network is storing this information to use later. 


Depending on the type of picture that you are presenting back to your 
neural network, it is possible that the network is going to have to go through 
a lot of layers before making the prediction. The nice thing here though is 
that the more layers and the more details that you are able to give to the 
algorithm, the better off it is going to be and the more accuracy the neural 


network will present with its predictions over time. 


After the neural network is done with going through the various layers that 
come with this, it is going to remember. It will not have to start all over 
with the predictions that it makes once it is done. Instead, it will be able to 
learn as it goes, figuring out what it finds, and comparing. Over tie, if this 
algorithm is set up the proper way, it will be presented with any kind of 


picture that you want, and it will know what item is inside. 


Any time that a programmer wants to work with the neural network 
algorithm, you will often be working with things like face recognition 
software and other similar projects. When this happens, all of the 
information that you and the program need won’t be available ahead of 
time. But you are able to use this method in order to teach the system the 
best way to recognize the right faces in the image. You can also use this one 
to help with different types of animals, to define models of cars, and so 


much more. 


As you can imagine reading through this chapter, there are a lot of different 
advantages that come with this particular model when working on machine 
learning. One of the advantages that you are going to notice is that you can 
utilize these methods without having to control the statistics of the 
algorithm. Even if you need to use it without the statistics being available, 
the neural network will still be able to finish the work for you. The reason 
that this ends up working so well is that both the dependent and the 


independent variable are going to be nonlinear. 


There are a few times when you will not want to work with this method. 
One of the main reasons that programmers would choose not to go witha 


neural network is that it is another one of those models that has a high 


computing cost to get the information. For some of the smaller businesses 
who are interested in working with machine learning and doing this kind of 
technology, the cost is just going to be too much time, computational power, 


and money and they will need to look at some other algorithms instead. 


The Mathematics of the Neural Networks 


The first things that we need to know when it comes to some of the math of 
neural network are that it is pretty simple to work with—and you are able to 
solve most of it with a calculator, pen, and paper, at the very most. Even 
though it is possible to do it this way, you have to remember that it is 
possible to have thousands of these neurons, so this solving would take 


forever, even when it is simple. 


A second problem that can come up is that a lot of the calculations that you 
can do with the neural networks will need matrices. If you are not familiar 
with or comfortable with using these matrices in math and in codes, you 


will find that it can make the mathematics more difficult to work with. 


Since we are not going to do all of the math for the neural networks with 
our calculator or with pen and paper, it is time to take a look at some of the 
basics that come with these networks. This starts out with “weights.” A 
“weight” is going to be a connection between your neurons that can carry a 
value. The higher the value, the larger the weight, and the more importance 
that we are able to attach to a neuron on the input side of our weight. In 
addition, you will find that in programming and math, you need to view 


these kinds of weights in the matrix format to make it easier to compute. 
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As we are able to see with the image above, the input layer is going to work 
with three neurons, and then the hidden layer, which is going to be right 
below it, is going to have four neurons to work with. With this information 
in place, we are able to create a matrix of three rows, and then four 
columns, before inserting the values of each weight into the matrix that we 


are doing as we have done in the image above. 


In order to help us out with this matrix, we are going to call it W1 to call up 
later. IN the case where we are going to have some more layers present, we 
would want to go with more of these weight matrices such as W2, W3 and 
so on. In general, if a layer L has N neurons, and then the next layer is 
going to have L + 1 has M neurons, the weight matrix is going to be an N 
by M matrix. This means that there are going to be N rows as well as M 


columns. 


Again take another look at the image above, and you will see that the 
largest number shows up in the matrix for W22, which is going to carry the 
value of 9 inside of it. The W22 is going to connect IN2 at the input layer 
together with the N2 at the hidden layer. What this is going to mean is that 
“at this state” or right now, the N2 value is going to thing that out of all the 


three inputs, it is going to see the IN2 input as the most important when it is 


trying to make some of the decisions that it wants. 


The second thing that we need to take a look at here when it comes to the 
mathematics of the neural networks is the bias. This bias is another part of 
the code that has some weight. Imagine that you are working on a situation, 
or otherwise trying to use the program to make some decisions. You will 
need to stop and think about all of your possible (or even observable) 


factors. 


While it is sometimes easy to see the observable factors, we have to think 
about some of the parameters that could influence the code, but we haven’t 
thought about them yet. What about those factors that you just haven’t been 
able to consider yet. When we are doing a Neural Net, we try to not only 
factor in the things that we are able to observe—we will also use it to help 
cater to the non-observable and unforeseen factors. This is going to be 


known as the bias in deep learning. 


Every neuron that is not yet in the input layer is going to have a bias that 
can be attached back to it. And when we look at the bias, we will see that it 
works similar to the weight in that it is going to carry a value connected to 
it. Let’s take a look at a picture that helps us to understand how this is going 


to play out in the neural network: 
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As we look at this, we are going to notice that there are two layers that we 
need to focus on, the layer that is in the hidden layer, and the one in the 
output layer, or the ones that have four and two neurons to them 
respectively. When you take a look at this, you will start to see that each 
one of these neurons will have a tiny blue or red arrow that points right to it. 
You may also see that while these arrows are pointing, they are not going to 


have a neuron as their source. 


Similar to what we were doing with the weights before and how we turned 
them into a matrix, the bias is also something that we are able to see as a 
matrice that has one column, or we can call this a vector in deep learning 
language. Using the picture above to help us out, you would express the 
bias that is present for the hidden layer as [[ 0.13] [0.14], [0.15], [0.16]]. 


Now that we know the weights and the bias, it is time to take a look at the 
activation that comes with these neural networks. After aggregating all of 


the input into it, we are going to take a look at the aggregation and call it 


“z..” We will take a closer look at the aggregation in a bit—but just go with 
us on this one. When you get to aggregation z, a neuron is supposed to be 
able to make a small decision with that output so that it can then return an 


additional output. 


This kind of process, or the function, is going to be known as the activation. 
We are going to be able to represent this function as f(z), where z is going 
to be the aggregation of all the input. To help us see how this works a bit 
more, we should first know that there are two broad categories that come 


with activation, and these are going to be non-linear and linear. 


If we see that f(z) = z, we can say that the f(z) is a linear activation, which 
means that nothing is going to happen. The rest are going to be seen as non- 
linear and will be used in a slightly different manner. Let’s take a look at an 


example of this and then explore what all is going on inside of this. 
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Now, we have a few options to work with here because there is more than 
one way that we can get the neuron to make the right decision. This means 


that there are a few choices of what f(z) is able to me. Some of the most 


popular options and the ones that you are most likely to use in your coding 
will include: 


1. Rectified linear units or ReLU: We will see some of this in the codes 
that we write later on in this guidebook, so it is a good idea to gain 
some familiarity with it now. With the ReLU, we are going to make 
sure that any output we have is not going to be allowed to go toa 
negative number, or below zero. This means that if z is higher than 
zero, then our output is going to remain at z. If it is negative, then the 
output will have to be zero. The formula that you would need to use 
with this one is f(z) = max(0, z) . To make this easier, we are going to 
make sure that the maximum we have is going to be between 0 and z. 

2. Tanh: Here, the formula that we are going to use is f(z) = tanh(z) . It is 
really that simple because we are just finding the hyperbolic tangent 
that goes with z, and then we will return this. A scientific calculator 
will be able to get this done for you. 

3. Sigmoid activation: This is another thing that we are going to start 
seeing a lot in some of the syntaxes that we use throughout this 
guidebook, and you will start to become really familiar with it as we 
go along. For this one, we are going to use the formula of f(x) = 1/(1 


+ e/\(-1*z)). The steps that we do to get this one done are as follows: 


a. We want to negate the z. This is done by multiplying by -1. 

b. Find the exponent of the output in 1. Again, a calculator 
can do this for you to speed up the process. 

c. Add 1 to the output that you did in the step above. 

d. Divide 1 by the output that you got in the step above. 


e. If you are able to do all of these steps, then you have 


figured out the sigmoid activation. 


Keep in mind with this one that there are going to be a lot more of the non- 
linear activation functions compared to the linear ones. Because of this, it is 
most likely that you will work with the non-linear functions in your codes. 
Also remember that the choice that you make in the functions will be highly 
dependent on the problem you would like to solve, as well as what the NN 


is trying to learn in the process. 


The Mathematics of It All 


At this point, we have gone over some of the basics of what we need to 

know when working with the mathematics of a neural network. With this in 
mind, it is time to do some of the math. We went over all of the parts to help 
us know what is going on, but now we need to actually take the information 
that we find in the neural network, and do some of the math based on the set 
of data that we have. Let’s have a look at the following image to help us get 


started on this in the right way. 
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Yes, we did see this same image when it comes to activation functions that 


we did earlier. Now, it is time to take a look at the aggregation that we 


talked about before, and then we will move on to some of the math as well. 
See all of the things that are in parenthesis in the image above? Call that the 


z function that we are working on. Then, keep the following things in mind: 


1. b = this is going to be the bias. 

2. X = this is going to be the input that we would like to send to the 
neuron. 

3. w = this is going to be the weights that we talked about at the 
beginning of the chapter. 

4. n= this is going to be how many inputs are going to be sent up from 
the incoming layer. 


5. I= this is going to be a counter that will go from 0 to n. 


Before we work on this any further note that in the beginning, or initially, 
the only neurons that are going to have some values attached back with 
them are going to be any of the input neurons that are on the input layer 
(These are going to be the values that we are able to observe from the data 
that we are using to train the network. This brings up the question of how 


we are able to make this work: 


1. First, you want to make sure that you multiply every incoming neuron 
by the corresponding weight that goes with it. 

2. Then, stop here and add up all of the values that you are working with 
as well. 

3. Add in the bais term that will handle the neuron in question to finish it 
off. 


And this is all that you have to do in order to really evaluate the z for the 
neuron. This seems pretty simple, but remember that we have only been 
discussing how to do it with one neuron. Imagine that you were going to do 
this for thousands of neurons in many different layers. If you have 
thousands of neurons and hundreds of layers, think about how long this 
whole process would take to accomplish. It would take you a very long for 


you to figure out how to do all of this, even if you wanted to. 


The good news with this one is that there are a few tricks that you are able 
to use to make this one work for you. Remember the vectors and the 
matrices that we spent some time on earlier? Here’s when it is time for us to 
use them. The steps that you need to do to help out when you have a ton of 


neurons and layers to focus on includes: 


1. You want to start this process out with a weight matrix from the 
input layer and have it go to the output layer as we described 
earlier. 

2. Create a matrix that is M by 1 from the biases that you got 
earlier. 

3. You can then view the input layer in the form of an N by 1 
matrix. You can also do this in the vector of size N if this is 
easier to work with the bias. 

4. Transpose the weight matrix. When you do this, you are going to 
end up with a matrix that is M by N. 

5. From here, we want to find a dot product of the transposed 
weights, as well as for the input. According to the dot product 


rules, if you find the dot product in your matric that is M by N, 


and you find a matric that is N by 1, then you will end up with an 
M by 1 matrix as well. 

6. Add in the output that you get in the fifth step to your bias 
matrix. If you went through and did this in the right manner, then 
you will find that these are going to have the same size. 

7. And finally, you are going to have all of the values for your 
neurons. This should give you a matrix that is M by 1, ora 


vector size of M. 


After you have been able to go through and do all of these steps, you can 
then go through and run the activation function, choosing the one that you 
like the most, on every value that shows up in the vector. You have to go 
through and do this for all of the weight matrices that you have, finding the 
values of the neurons and units as you go forward. You will continue with 
this until you are able to get to the end of the network, which is going to be 


the output layer. 


The last thing that we are going to need to do here is to calculate out how 
far we actually are when it comes to the original output, and where we can 
attempt to correct all of the errors if there are any. Just remember that the 
methodology that we are going to focus on for this one really only works 
with networks that are fully connected. The weight matrices that come into 
play for some other matrix choices out there will be different. Now that we 
have gone through these steps, you will be able to build up your own Neural 
Network and calculate the output based on some of the given input. It is an 
easy process, but it can take some time to do if you have a lot of neurons in 


a lot of different layers. Going through the steps that we have talked about 


in this chapter to make sure that you can create and do the math with the 


neural networks when we do some deep learning. 


Chapter 5: The Basics of Using the 


TensorFlow Libraries 


Now that we have had some time to look over TensorFlow a bit and have it 
downloaded onto our system of choice; it is time to look at some of the 
basics that come with this library. There are a lot of different parts that 
come with this library—and while we talked about a summary of it before, 
we need to go a bit more in-depth to make sure that we really understand 
how this works, as well as what we are able to do. Before we dive into 
some of the codes that we need to do with deep learning, we are going to 
first look at some of the different things that come with TensorFlow, to help 
us get prepared. Some of the basic parts that you need to know more about 


when it comes to the TensorFlow library include the following: 


DataFlow Graphs 


The first thing that we are going to take a look at is the dataflow graphs. 
When you are working in TensorFlow, the computation is going to be based 
all on the graphs. These graphs are so important because they are going to 
be there as a way to solve many mathematical problems in your system. 


Let’s take a look at the expression that is below: 


X = (y+z) * (z+4) 


It is also possible to take the expression that is above and show it in another 


way, including the following: 


P=yt+z 
Q=zt+4 
X=p*q 


When it is represented by the second method, it is going to be so much 
easier for us to express it in graph form. In the first part, we had a single 
expression to work with, but when we divide it up again, you end up with 
two expressions, and both of these can be performed in parallel. We can 
gain from this in terms of the time for computation. Such gains are going to 
be important when it comes to deep learning and applications of deep 
learning, especially when we are talking about Recurrent Neural Networks 
(RNN) and Convolutional Neural Networks (CNN). These two neural 
network architectures are going to be more complicated, which is why we 
need to make sure that we are working with the graphs in the proper 


manner. 


The goal of the TensorFlow library is to use it to implement graphs, and to 
make sure that it helps with the computation of operations in parallel. This 
is going to lead us to see some more efficiency in the gains. In this library, 
the graph nodes are going to be known as tensors, and they are basically 
just a multidimensional data array. 

The graph that you are going to work by starting in the input layer, where 
we should expect to find the input tensor. After the input layer, we are going 
to get to the hidden layer, which has rectified linear unit as the activation 


function. 


Constants 


Next on the list to focus on is the constants. When we are taking a look at 
the TensorFlow, we are going to create these various constraints using the 
function constant. This function constant is going to provide us with the 


signature that is given here: 


constant(value, dtype=None, shape-None, name=’Const’, verify_shape=False). 


Let’s take a look at this signature now. Where the value is the actual 
constant value to be used for more computation down the line, the dtype is a 
data type parameter such as int8, int16, float32, and float64. Then, we move 
on to the shape, which is going to allow you to put in some dimensions if it 
is needed. Then, there is the name. This is also optional, and you can decide 
whether you are going to put it in or not. This is going to be a name that you 
can give to the tensor while the last parameter that is present is going to be 
a type of Boolean, which will indicate the verification of the shape of the 


values. 


Now, it is possible that you will need to pick out constants that have a 
specific value in them. If this is something that has to be done in your own 
training model, you will want to make sure that you are picking out a 
constant object to make this happen. You can take the signature from above 


and add in the names or numbers that work the best for your code. 


Variables 


The TensorFlow library is also going to spend a bit of time looking at the 


variables. These variables are going to refer to in-memory buffers that have 


tensors that need to be initialized explicitly and used in graph to make sure 
that the state is maintained through the whole session. When you decide to 
call up one of the constructors, the variable is going to be added back to the 


computational graph as well. 


Variables will be used, for the most part at least, when you first begin with a 
training model, and they will be used for holding and updating parameters. 
The initial value that you decide to pass for the argument to the constructor 
is going to represent either the object that will either be returned or 
converted into a tensor. What all of this means is that we need to be able to 
take any variable we want to work with and fill it with either a random or 
predefined value that can then be used later in the training process, and 
even updated through the iterations. A good way for us to define this will 
be: 


m — tf. Variable(tf.zeros([1]), name = “m”) 


Sessions 


The next topic that we need to take a look at is going to be the sessions. In 
order for us to evaluate the nodes, we have to make sure that the 
computational graph that we are using is able to run within the current 
session. Remember here that the purpose of the session is going to be to 


encapsulate the state and the control of the TensorFlow runtime. 


If you are working on a new session, and it doesn’t end up having any 
parameters in it, it is going to resort to using the default graph that was 


created in the current session. If you do add in some parameters to this, the 


session Class is going to accept the parameters of the graph that you set, 


which is used when you execute the session in the first place. 


To get a better idea of what is going on with this kind of thing, let’s take a 
moment to look at the “hello” code and how it is going to work with this 


library to see how a session of TensorFlow is going to work: 


import tensorflow as tf 
h = tf.constant(‘Hello TensorFlow!”) 
s = tf.Session() 


print(s.run(h)) 


The code that you are going to get when you do this will be Hello 
TensorFlow! This may seem pretty simple, but it is a good way to get some 
practice when you are working in Python and shows you a bit of what you 


are able to do with a code in the TensorFlow library. 


Placeholders 


As you are working on some of the codes that you would like to write out in 
TensorFlow, you may find that there are times when you will not be aware 
of the value of array y during the initial declaration phase of our 
TensorFlow problem. This means that we are not aware of this value during 
the stage for tf.Session() as ses. When this happens, TensorFlow is going to 
expect that we will declare the basic structure of our data by use of 
tf.placeholder variable declaration. This ensures that we still have something 
present, and allows the program a chance to learn, without the code 
bringing out an error because there wasn’t anything present there to start 


with. 


We are able to use the idea of this for y by using the code below: 


# creating TensorFlow variables 
y = tf.placeholder(tf.float32, [None, 1], name = ‘y’) 


Since we are not going to go through and providing an initialization to the 
declaration, we should stop and notify TensorFlow of the data type of all the 
elements that should be in the tensor. The aim here is to use tf.float32 . Our 
second argument is going to denote the shape of the data to be injected into 


the variable. Our aim here is going to be to use the tf.float32 for this point. 


Then, we will move on to the second argument. This one is going to denote 
the shape of the data that we would like to inject into the variable. This 
means that we need to use the array that comes in the size of (? X 1) . Since 
we do not know the amount of data that we can supply back to the array, we 


are using the “ ? ” to help us with this. 


From there, we are going to work with the placeholder. This is going to 
accept the argument of None for the declaration of the size that we are 
going to use. After that, we are going to take some time to inject any 
amount of 1-dimensional data that we would need to use later into the 
variable y. At one point, we are going to see that our program is going to 


expect a change in the ses.run(x, ...). 


Now that you are done with writing out all of this other code, you should 
have implemented your graph in this library. You will be able to look back 


at the code and see the following: 


import tensorflow as tg 

import numpy as np 

# Begin by creating a TensorFlow constant 
const = tf.constant(2.0, name = “const”) 

# create the TensorFlow variables 

y = tf. Variable(2.0, name = ‘y’) 

z = tf. Variable(1.0, name = ‘z’) 

# let us create the operations 

p = tf.add(y, z, name = ‘p’) 

q = tf.add(z, const, name = ‘q’) 

x = tf.multiply(p, g, name = ‘x’) 

# creating a variable initialization 

init_op = tf.global_variables_initalizer) 

# Launch the session 

with tf.Session() as ses: 

# initialize the variables 

Ses.run(init_op) 

# calculate the graph output 

X_out = ses.run(x) 

print(“Variable x has a value of {} “.format(x_out)) 
# creating TensorFlow variables 

y = tf.placeholder(tf.float32 [None, 1], name = ‘y’) 


x_out = ses.run(x, feed_dict = {y: np.arange(0, 10) {:, np.newaxis]}) 


If this code is not there, take some time to type it into your compiler and see 
how it is going to work for you. This will ensure that you are able to get the 
best results out of your work and can make it easier for you to see how to 
work with this program, and it will make it so much easier for you to make 
some of your own graphs in this library. And with that done, you will find 
that working with the TensorFlow library, and the graphs that you need will 
make a difference in how well you are able to do with this library and with 


deep learning. 


When all of this information is put in place, you are going to get the value 


that says, “ Variable x has a value of 9.0”. 


Chapter 6: Deep Learning with 


TensorFlow 


Now that we have gotten a chance to look a bit more at TensorFlow and 
some of the different things that you are able to do with this library, it is 
time to take a look at what you are able to do with deep learning and this 
library. This is the meat and potatoes of the guidebook—the good stuff that 
we have been looking for as we go through this guidebook. Hence, let’s 
break it down and see some of the different things that we will be able to do 


when we are working with this library. 


As we go through this chapter, we are going to be demonstrating all of the 
processes that are needed to train a model of neural networks inside of the 
TensorFlow library. This is going to be done with the help of the API’s 


estimator known as DNNClassifier . 


Our goal with this kind of neural network is to train it with the help of the 
MNIST dataset. This is a dataset that is responsible for creating the “ hello 
world ” code inside of any project in deep learning. You will be able to find 
this “hello world” dataset inside the package for TensorFlow so that should 
already be set up for you. You will find that while looking at this particular 
dataset, it is going to be a 28/28 grayscale image with all the digits 
handwritten. It is a larger dataset to work with since it contains 55,000 


training rows, 5,000 validation rows, and 10,000 testing rows. 


Importing the Data That You Need 


With that in mind, we need to move on to the first step here. We need to 
make sure that we have all of the necessary data in place to help us write 
some of the codes that are needed in this guidebook. First, it is time to 


import the libraries that we need to use, including the following: 


import tensorflow as tf 

import numpy as np 

from keras.datasets import mnist 

from keras.models import Sequential 

from keras.layers import Dense 

from keras.layers import Dropout 

from keras.layers import Flatten 

from keras.layers.convolutional import Conv2D 
from keras.layers.convolutional import MaxPooling2D 
from keras.optimizers import Adam 

from keras.utils import np_utils 

from PIL import Image 

import numpy as np 


import os 


There is a lot of different things that we need to ensure to be on our 
computer before we are able to work with the program, we will do next. 
This may seem like a lot, but it is necessary to get all of the parts ready and 
working together. As you may have noticed, we are going to import a few 
things from the Keras library as well, so it is critical that if you have not 
already installed and downloaded that library on your computer, that you do 


so before we get going on this journey. 


We are going to use the following code to help us load out the set of data 


that we are going to use here. The data is importable by the use of the Keras 


library that we talked about before. You will see the progress of the data 
download as you do this. The code that we need to make this happen 


includes: 


(X_train, y_train), (X_testing, y_test) = mnist.load_data() 


Next, it is time for us to take a moment to change up the data that we need, 
making sure to reshape it the way that we would like. Since we are still 
working with one of the convolutional neural networks, we are going to 
make sure that our data is being reshaped into the batch, width, height, and 


channels. 


It is also possible for us to take a moment to add in some of the images that 
we want to this thing. We can do it both in the training data, as well as 
inside the test data. The code that we are able to use in order to get this 


done includes: 


def load_images(image_label, image_directory, features_data, label_data): 
files_list = os.listdir(image_directory) 
for file in files_list: 
image_file_name = os.path.join(image_directory, file) 
if “ping” in image_file_name: 
img = Image.open 
(image_file_name).convert(“L”) 
img = np.resize(img, (28, 28, 1)) 
im2arr = np.array(img) 
im2arr = im2arr.reshape(1, 28, 28, 1) 
features_data = np.append 
(features_data, im2arr, axis = 0) 
Label_data = np.append(label_data, [image_label], axis = 0) 
Returmm features_data, label_data 


This code that we just went through is going to help us to lead up the 
features and all of the labels that we need. Keep in mind here that we have 
just taken the time to define our function, and we named it as_load_images() 
taking in four parameters for this one. This means that it is going to list out 
all of the files that are available in the image directory. The function is then 
going to check the format of all the images, whether png or another option. 
If you have .jpg in your system, then these .png images are going to be 


taken over to .jpg. 


The images, at this point, are going to be loaded into the system and they 
will be converted into an array, which is going to be the same as the 
features data and an image array is going to be added into this. It is going to 
take an image label, then add it to the label_data part of all this. 


Once we have been able to get all of the images set up and ready, and we 
know that the right folder or directory (depending on what you have 
chosen) is holding them, then the current set of data that we are in will 


return these images back. 


Now, it is time to move on to the next step. We are now going to need to 
give the images their own directories to make sure that they are properly 
loaded onto the existing set of data that we want. This means that we will 
simply need to load the images into the training and the test sets. To make 


this happen, we will need to use the code below to help us: 


X_train, y_train = load_images(‘1’, ‘F:/mnist’, X_train, y_train) 


X_test, y_test = load_images(‘1’, ‘F:/mnist’, X_test, y_test) 


From here, we need to take a moment to normalize the data. The inputs are 


going to be normalized with a range of 0-255 to 0-1: 


X_train/-255 
X_test/=255 


We have the labels here, but they have not had a chance to be categorized. It 
is now time for us to start to categorize all of these by using the code that 


we have below: 


total_classes = 10 
y_train = np_utils.to_categorical(y_train, total_classes) 


y_test = np_utils.to_categorical(y_test, total_classes) 


It is now time for us to do the work that is needed to help us create the 
model. The following code is a good one that you are able to use in order to 


make this happen: 


model = Sequential 
model.add(Conv2D(32, (5, 5), 
input_shape = (X_train.shape[1], X_train.shape [2], 1), activation = ‘relu’)) 
model.add(MaxPooling2D(pool_size = (2, 2))) 
model.add(Conv2D(22, (3, 3), activation = ‘reul’)) 
model.add(MaxPooling2D(pool_size = (2, 2))) 

model.add(Dropout(0, 5)) 

model.add(Flatten()) 

model.add(Dense(128, activation = ‘relu’)) 

model.add(Dropout(0, 5)) 


model.add(Desnse(total_classes, activation = ‘softmax)) 


Keep in mind with this one that you will have some dropouts that have been 


added to the process. These are going to be there because it ensures that the 


model doesn’t get to suffer from overfitting. With this in mind, we are able 


to compile the model with the following code to help us get it done: 


model.compile 


(loss = ‘categorical_crossentropy’, optimizer = Adam(), metrics = [‘accuracy’ ]) 


When we get to this point, we are going to be using a model that we were 
able to compile without any errors. This is a good thing. If you do have 
some kind of error that is showing up in the model that you use, then it is 
time to go back through and double-check the work that you are doing 


because something is missing. 


Now that the model is done and you know it is working in the proper 
manner, it is time to train the model. This is a simple addition that we are 
going to be able to do with just a few more lines of code to make it happen. 


The code that we need to use to make this happen includes: 


model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 7, batch_size = 200) 


Take note here that we are only working with 7 epochs. You can choose to 
use as many of these as you would like inside of your code. But for the 
most part, this is going to be enough. Most programmers are going to notice 
that they will not see a ton of improvement when it comes to the accuracy 
when they go past the 7" epoch, so it is not worth your time to work with 
this. 


And there you have it. Type in the code above and see what happens when 


you ask the compiler to execute it. There is going to be a lot of information 


that is set up, and you can take some time to read through it. This is an 
important step to work with when you try to learn how to use some of the 
algorithms for deep learning and when you are done, you can celebrate 


creating your very first model in TensorFlow! 


Chapter 7: Some of the Basics of the Keras 
Library 


Now that we have had some time to look at the TensorFlow library, as well 
as what you are able to do with that, it is time to take a look at the Keras 
library. There is certainly a lot that the TensorFlow library will provide to us 
—but even in the model that we had, some parts of the Keras library were 
pulled up as well to help complete the process. There is a lot that the Keras 
library is able to help us out with, and we are going to take some time here 
to explore the variety that is available with this language, as well as what all 
you are able to do. Thus, let’s take a look at some of the basics that you 


need to know in order to get started with the Keras library. 


The Learning Rate 


You may find that working with the process of training a deep learning 
model or a neural network is going to be difficult. The standard algorithm 
that you will need to use to complete this kind of task is going to be known 
as stochastic gradient descent. It has been found that one is able to achieve 
increased performance and a faster training on some of these kinds of 
problems as long as they can use a learning rate that doesn’t stay static, but 


instead, they use one that can change over time. 


Using an optimal learning rate for this stochastic gradient descent 


optimization procedure is going to help to reduce the training time and can 


make sure that performance is improved overall. The purpose of learning 
the rate schedules in the first place is that this makes it easier for you to 
adjust the rate of learning that you want during the training of your neural 
networks. This brings us to our first exercise of how to use them inside your 


Keras library. 


The Keras library is going to come with one of these time-based learning 
schedules that are already built-in, so this will help to save you some time 
and hassle. The implementation of this stochastic gradient descent 
optimization algorithm in SGD class as an argument that we are going to 
call decay. We are able to use the decay argument in our time-based 


learning rate decay schedule equation, just like it is shown below: 


LearningRate = LearningRate * 1/(1 + decay * epoch). 


When we see that the value of this argument of decay ends up being 0, 
which is considered the default value in this program, it means that there is 
not going to be any effect on the learning rate. Let’s look at how this is 


going to be shown in the Keras library below: 


LearningRate = 0.1 * 1/(1 + 0.0 * 1) 

LearningRate = 0.1 

If you take the time to specify out the argument for decay, then the learning 
rate will end up decreasing from the previous epoch, and it will do this by 
the fixed amount that has been specified. You can also go through this 
process and set up your own default schedule that works pretty well. The 


way that you would do this is to set up the value of decay, as shown below: 


Decay = LearningRate / Epoch 
Decay = 0.1 / 100 
Decay = 0.001 


We need to be able to take all of this and create a nice example that will 
demonstrate to us how we can use this time-based learning rate schedule in 
Keras. This is going to make it easier for us to see how this learning rate is 
going to work and can make sure we get our own learning rate set up and 
ready to go. The example that we are going to use is known as the 


Ionosphere binary classification problem. 


First, we need to make sure that we get the right data set all up and ready to 
go. You will be able to download the data set from 
https://www.dropbox.com/s/3cewwavkalec3913/ionosphere.csv?dl=0. You 
need to take some time to download this set of data and then make sure that 
it gets saved to a file that is known as ionosphere.csv. We will then be able 
to create a small neural network model that will have just one hidden layer 
and 34 neurons in it, and we have to make sure that the rectifier activation 
function is in place. The output layer that we have is only going to contain 
one neuron, and it is going to use the sigmoid activation function to help us 


get probability like values when we are all done. 


As we go through this one, we are going to focus on bringing in a higher 
rate for the learning rate we will do on the stochastic gradient descent. And 
the option that we are going to focus on is going to be 0.1. The argument for 
decay that we are going to use will be 0.002 (0.1/50) , and we will choose for 


this one to have the training done for 50 epochs. 


Another thing that we need to focus on in this code is the momentum. This 
one is also good when we want to use the adaptive learning rate. We are 
going to set the momentum for this example at 0.8. Ensure that you save the 
data above in the same directory that we went and used when you saved the 
file earlier on with this code. An example of the writing that you will need 


to do to make this code work includes: 


import numpy 
from tensorflow.python.keras.models 
import Sequential 
from tensorflow.python.keras, layers 
import Dense 
from pandas import read_csv 
from skleearn.preprocessing import LabelEncorder 
from tensorflow.python.keras.optimizers import SGD 
seed = 7 
numpy.random.seed(seed) 
df = read_csv( “ionosphere.csv”, header=None) 
dataset = df.values 
X = dataset[ :, 0:34].astype(float) 
Y= dataset[:, 34] 
Encoder = LabelEncoder() 
encoder. fit(Y) 
Y = encoder.transform(Y) 
mod = Sequential() 
mod.add(Dense(34, input_dim = 34, kernel_initializer=’nomral’, activation = ‘relu’)) 
mod.add(Dense(1, kernel_initializer = ‘normal’, activation = ‘sigmoid’)) 
epochs = 50 
learning_rate = 0.1 
decay_trate = learning_rate / epochs 
momentum = 0.8 
sgd = SGD(lr = learning_rate, momentum = momentum, decay = decay_rate, nesterov = 
False) 
mod.copile(Loss = ‘binary_crossentropy’, optimizer = sgd, metrics = [‘accuracy]) 
mod.fit( X, Y, validation_split = 0.33, epochs = epochs, batchtsize = 28, verbose = 2) 


This may look like a lot of code, but it is going to help us to set up the 

learning rate that we need to work with and will ensure that we will see the 
results that we want here. Make sure to open up your Python compiler with 
the right things imported (these are listed at the beginning of the code), and 


then go through and see what happens when you execute the code! 


In this example, you will notice that the training has been done for 50 of 
these epochs. We have also done 67 percent of the data set and used that to 
train the model, and then another 33 percent of the data set was set aside to 
test and to help validate the model that we are making. The point of doing 
this is to make sure that we are able to double-check the work that we are 


doing and to see if it is working the way that we want. 


With this code as well, we are going to see that the accuracy of it running 
well and the answers that it will provide will be at 99.14 percent. The 
baseline of accuracy that you want to go with will be 95.69 percent, which 
means that we are doing really good with this and we know that when we 
decide to use the model, the results that we get are going to be pretty 


accurate. 


Drop-Based Learning Rate Schedule 


When we are working on this kind of schedule, we will see that the learning 
rate is going to be dropped systematically at certain times as we go through 
the training. The method is going to be implemented in such a way that the 
learning rate can be dropped by half after each fixed number of epochs that 


we get to choose. 


To make more sense of this, let’s say that we are going to work with a 
learning rate that has 0.1, but then after we get done with every 10 epochs, 
we are going to drop it by 0.5. The first ten epochs that we are using to train 
the model will start out with the learning rate at 0.1. Then, the next epochs 


will work with the learning rate of 0.05, and it continues on with this. 


This is going to be implemented when it comes Keras by using the 
LearningRateScheduler callback at the time of fitting of the model. When 
you use this kind of callback, you are able to define a function that will take 
your epoch number and uses it as an argument and then, in turn, returns a 
learning rate to be used in the code for the stochastic gradient descent cod 
that we did before. When we are looking at this, the learning rate that shows 
up in the regular stochastic gradient descent is going to be ignored 


completely. 


We are going to use the previous dataset of the Ionosphere, and then we are 
going to create our own network that will have just one hidden layer. In 
order to make this happen, we are going to create a brand new function that 


we will call step_decay() to implement with the following equation: 


LearningRate = InitialLearningRate * DropRatefloor(Epoch / EpochDrop) 


The part that is for the InitialLeamingRate is going to help denote the initial 
learning, which is going to be some kind of value, usually 0.1. Then, there 
is the DropRate that is going to denote the amount that we would like to 
modify this rate of learning each time that it needs to be changed and the 
epoch number of the current epoch while the EpochDrop that we use is going 


to denote how often we are going to change this rate of learning. You can 


set this in whichever way that you would like, but most of the time, we will 


want to drop it after every 10 epochs. 


The rate of learning for our example is going to be put at 0. What this 
means is that we are not planning to use it. However, you may need to bring 
this up to use as the momentum, and if this is true for you, then it is fine to 
set the number where you want. The code that you will need to use in order 


to do a drop based learning rate schedule as we plan is below: 


import pandas 
import numpy 
import math 
from pandas import read_csv 
from tensorflow.python.keras.models 
import Sequential 
from tensorflow.python.keras.optimizers 
import SGD 
from tensorflow.python.keras.layers 
import Dense 
from tensorflow.python.keras.callbacks 
import LearningRateScheduler 
from sklearn.preprocessing import 
LabelEncorder 
# the learning rate schedule 
def step_decay(epoch): 
initial lrate = 0.1 
drop = 0.5 
epochs_drop = 10.0 
lrate = initial_lrate * math.pow*(drop math.floor((1 + epoch)/epochs_drop)) 
return lrate 
seed = 7 
numpy.random.seed(seed) 
# load the dataset 
Dataframe = read_csv(“ionosphere.csv”, header = None) 


dataset = dataframe.values 


#create input and output variables 
X = dataset[: 0: 34].astype(float) 
Y = dataset[:, 34] 
encoder = LabelEncoder() 
encoder. fit(Y) 
Y = encoder.transform(Y) 
# create the model 
mod = Sequential () 
mod.add(Dense(34 input_dim = 34, kernel_initializer = ‘normal’, activation = ‘relu’)) 
mod.add(Dense(1 kernel_initializer = ‘normal’, activation = sigmoid’)) 


# Compile the model 


sgd = SGD(ir = 0.0, momentum = 0.9, decay = 0.0, nesterov = False) 

mod.compile(lose = ‘binary_crossentropy’, optimizer = sgd, metrics = [‘accuracy’ ]) 

# the learning schedule callback 

Lrate = LearningRateScheduler(step_decay) 

callbacks_list = [Irate] 

# Fit a model 

Mod.fit(X, Y, validation_split = 0.33, epochs = 30, batch_size = 28, callbacks = callbacks_list, 


verbose = 2) 


For this example, we are going to end up going through 30 epochs. You can 
change up that number to work the way that you would like based on how 
you want this program to behave. You can take some time to type this code 
onto your system and see what kind of answers you are able to get in the 


process. 


The Optimizers with Keras 


Each time that you are working with one of the neural networks, and you 
see that it has finished passing one of the batches through the network, and 
it has been able to generate the predicted results, it is time for that network 
to make a decision about what it wants to do with the difference between 


the obtained results and the true values. It has to do this in a way that the 


weights to the network can be adjusted towards the solution that you want 
to use. The way that you can determine this is going to be by using the 


optimization algorithm. 


There are actually a few different types of optimization algorithms that you 
are able to work with through Keras, and we have worked with a few of 
them in the code we have written so far. Let’s take this a bit further and 
explore some of the different things that you are able to do with the help of 


the optimization algorithms to get the best results. 


First, we will look at the SGD. This is going to stand for the Stochastic 
Gradient Descent that we talked about before, and it is going to be one of 
the classic examples of this kind of algorithm. When you work with this 
kind of algorithm, the gradient of the network loss function is going to be 
calculated in relation to every individual weight that shows up in the 
network. Each time that there is a forward pass that makes it through the 
network, it will lead you to a parameterized loss function. Then, each of the 
gradients that were originally created will be used for these weights, and 
then the answer that we get with this is based on the gradients multiplied by 


our chosen learning rate. 


SGD can also be seen as one of the simplest of these algorithms, simply 
because of the behavior and the concepts that come with it. When you give 
this kind of algorithm a small learning rate, it is going to be able to follow 
the gradient that comes with it on the cost surface. The new weights that we 
get after each iteration will always end up being stronger and better 


compared to the one that you saw previously. 


Because the SGD algorithm is so simple, it is a good one to use any time 
that your network is a bit shallower. However, it is important to note that 
you will find that this algorithm is going to converge at a slower rate 
compared to some of the other algorithms that are available in this library. It 
is also going to have the least capability to escape any of the optimal traps 
that are available with the cost surface. This is why it is best to not use this 
algorithm with some of the deep networks that you want to use, even 
though there are a lot of benefits to using it. Any time that you would like to 


access this algorithm though, you can access it from the following: 


keras.optimizers.SGD 


We also need to take a look at the SGD algorithm and how it works with the 
Nesterov momentum. This kind of momentums is one of the parameters 
that has been developed in order to make all algorithms, but especially the 
SGD algorithm, converge faster. The technique that is here will use 
momentum. The momentum technique is going to work by introducing 
information from the previous steps to help make a determination in this 


step that we are in right now. 


This means that the descent in an algorithm is not going to rely only on the 
current determination that the algorithm is giving you, but it will also take a 
look back and determine the descent based on the steps that were taken 


earlier. 


This kind of momentum is actually going to come with a lot of advantages 
overall. It is going to help when it is time to handle a problem that is 


common any time that you use the algorithm of SGD on its own and this 


problem is the idea of local minima traps. If the local minima end up being 
wide enough to push the gradient step back to itself, then it is possible that 
your SGD could get stuck in the process. 


However, when you work with the momentum, the learner can jump and 
will be able to avoid the local minima. Momentum techniques can provide 
us even more benefits than this though. This benefit is that they are able to 
learn so much more quickly, which is going to be achieved by selecting 


some larger learning rates. 


One method that you are able to use in order to apply iterations is for each 
iteration to be made in the system by the learner. This helps us to create a 
new vector of decaying average of the past steps that were taken by the 
algorithm, then sum them up with the vector that is found in your current 
gradient. From there, we are able to take the direction that our summed 


vector is going to be heading. 


The Nesterov momentum is going to vary the approach that we talked about 
above a bit, but this is a good thing because it provides us with some better 
results. This momentum is going to take in a decayed average from the 
previous steps, and the steps that go in that direction first. From there, we 
are able to calculate the gradient from the new position by use of our data, 
and then we perform the correction. The weights are then going to be 
updated twice each time that you do an iteration, first using the momentum, 


and second working with the gradient algorithm. 


This is why many programmers prefer to work with the Nesterov 


momentum compared to just the simple momentum that we talked about 


before. This one is able to use some additional information, which, in this 
case, is going to be the gradient of the data at the uncorrected point. This 
makes sure that we have as much information as possible in place to give us 


the results. 


One thing to keep in mind with this one is that by default, the algorithm for 
SGD is not going to work with the momentum at all. However, if you 
would like to make sure that you are using some momentum in your code, 
and you want to get the Nesterov momentum to show up as well, you are 


able to configure it with the code below: 


keras.optimizers.SGD(momentum = 0.01, nesterov = True) 


The next topic that we are going to take a look at is the idea of Adagrad. 
This is going to be a more advanced technique that comes with machine 
learning, and it is set up to perform some gradient descent using the 
variable learning rate. For this one, the node weights that were already 
known to have a larger gradient are going to be assigned to a large gradient, 


and the small gradients are going to be assigned to small gradients. 


What this means is that the Adagrad is going to be effective for the SGD 
algorithm if you decide to use it with a per-node learning rate scheduler that 
is built into the algorithm. This technique may be more advanced, but it is 
able to improve the SGD by providing weights with learning rates that are 
historically accurate rather than relying on a single learning rate for all of 
the nodes. The way that you are able to access this particular technique is 


going to be found below: 


keras.optimizers.adegrad 


In a similar note, we are going to move on to the Adadelta. This is going to 
be a kind of Adagrad that will rely on the momentum techniques in order to 
handle the problem of monotonically decreasing your learning rate when it 
is needed. When we use Adadelta, the gradient update on every weight is 
going to be a weighted sum of the current gradient and the exponentially 
decaying average composed of a limited number of the past gradient 
updates. 


Here we are going to find a gradient denominator, and for this situation, it is 
not going to be monotonically decreasing—hence, the learning rate is going 
to start looking more stable in the process. The reason that we want to do 


this is to make sure that the algorithm is more robust overall. 


When we go through the first implementation of this, the Adadelta is going 
to be fine if you don’t have a learning rate parameter that you set up. 
However, this library is going to have an Adadelta that is modified and will 
have a defined learning rate that’s consistent with some of the other 
optimization algorithms that we will work on. The best way to access this 


algorithm in the Keras library is going to be with the following: 


keras.optimizers.adadelta 


We can also work with an optimizer that is known as the RMSProp. This 
one is going to be a kind of correction that comes with Adagrad, and it was 
proposed in an independent manner from the Adadelta optimizer. It is going 
to be similar to the optimizer that we talked about before, but the main 


difference is that the learning rate is going to be divided up even further 


using an exponentially decaying average for all of the squared gradients or 
the global tuning value. 


Unless you have more experience with deep learning or you have done this 
kind of thing before, it is often best if you just leave the hyperparameters of 
this kind of optimization algorithm in their default setting. You will be able 


to access this optimizer in the Keras library with the following code: 
keras.optimizers.rmsprop 


Adam, or the Adaptive Moment Estimation, is going to be another decaying 
average that will be able to store this average of the past squared gradients. 
In addition to being able to do this kind of task for you, it is going to make 
sure that it holds onto the exponentially decaying average of the past 
gradients, just like with momentum. 


You can think of the Adam feature as a combination of momentum and the 
RMSProp optimizer. It is going to be following a path that seems similar to 
the one of a ball that has both momentum and some friction. Adam is going 
to add in some bias to the path that is followed by the algorithm towards a 
flat minimum on the error surface, with the learning being made slower 


when it moves on a larger gradient. 


If you are looking for one of the most popular optimization algorithms that 
are available in this library, then Adam is the right one. This could be due to 
the fact that Adam is going to provide programmers with a smart learning 


rate and because of the momentum behaviors that you are able to use with it 


at the same time. The code that you need to access Adam optimizer on 


Keras will include the following: 


keras.optimizers.adam 


As one of the improvements for the Adam optimizer that we just took a 
look at, we are going to work with the AMSGrad. This is going to be the 
most recent of the improvement proposals. It was found over time that with 
some of the sets of data out there, the Adam optimizer wasn’t able to 
converge and provide a globally optimal solution. But some of the easier 


and more simple algorithms, like SGD, will do this. 


In these cases, it is hypothesized that in some of these sets of data, 
including those that are used in image recognition, there are going to be 
some smaller, albeit less informative, gradients that are going to be caused 
by the occasional large and more informative gradients. Adam is going to 
have kind of a tendency to deprioritize more informative gradients because 
such is swallowed quickly by those weighting exponentially, which means 
that the algorithm is more likely to steer beyond the optimality point 


without taking the time to explore it properly. 


You will find that the AMSGrad algorithm is able to perform well in some 
of the sets of data that you use it with. But it is not going to be able to 
displace the Adam optimizer, at least not yet, because of its lack of 
verifiability when it comes to winning with some of the general-purpose 


sets of data. The best way to access this part of the code is going to be: 


keras.optimizers.adam(amsgrad = True) 


The Keras Metrics 


The next thing that we need to take a look at when we are in the Keras 
library is some of the metrics. When you are in this library, you are able to 
list out all of the metrics that you would like to monitor when you are 
training your model. This can be done by using the metrics argument then 


passing to it a list of function names to the compile?) method in the model. 


The metrics that you decide to list out, in this case, can either be string 
aliases of the functions you want to use or be the name of the different 
Keras functions that you want to use. The values of the metrics will be 
recorded by the system at the end of each epoch on the training of the set of 
data. If the validation dataset is provided, then the recorded metric is also 


going to be calculated for your validation. 


These metrics are going to be reported in a certain manner, mainly in the 
verbose form in the history object. This will be returned after calling in the 
fic) function. The metric function name is normally used as the key to the 
metric values. If you are working with metrics that are part of the validation 
dataset, you will need to make sure that the val_prefix is used there to get it 


to work out well. 


There are several regression metrics that work well in Keras, and 
understanding how this works, and how they go together will make a 
difference in how well you are able to write out these kinds of codes. The 
following is going to be a list of some of the metrics in Keras that can be 


used for regression problems: 


1. Mean Squared Error. We are going to see this listed in the code 
as mean_squared_etrror . 

2. Mean Absolute Error. We are going to see this listed in the code 
as mean_aboslute_error 

3. Mean Absolute Percentage Error. We are going to see this listed 
in the code as mean_absolute_percentage_error . 

4. Cosine Proximity. We are going to see this listed in the code as 


cosine_proximity . 


The first thing that we need to look at is going to be the regression metrics 
that are found in Keras. There are a few different regression metrics that can 
provide you with the answers that you want. The best way to see how they 


work though is going to be in the following code that we will do below. 


from tensorflow.python.keras.models 
import Sequential 
from numpy import array 
from matplotlib import pyplot 
from tensorflow.python.keras.layers import Dense 
# A sequence 
X = array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9. 1.0]) 
# create a model 
mod = Sequential() 
mod.add(Dense(2, input_dim = 1)) 
mod.add(Dense(1)) 
mod.compile(lost = ‘mse’, optimizer = ‘adam’, metrics = [ ‘mse’, ‘mae’, ‘mape’, ‘cosine’ ]) 
# train the model 
history = mod.fit(X, X, epochs = 500, batch_size = len(X), verbose = 2) 
# plot the metrics 
Pyplot.plot(history.history [‘mean_squarred_error’]) 
Pyplot.plot(history.history[ ‘mean_absolute_error’ ]) 
Pyplot.plot(history.history[ ‘mean_absolute_percentage_error’]) 


Pyplot.plot*history.history [‘cosine_proximity’ ]) 
Pyplot.show() 


When you put this into your compiler, you should see that it returns the 
values of the metrics for each epoch that you have set up in there. Note here 
that we did take the time to specify the string alias names, and they were 
going to be referenced as the key values on history objects by use of their 
function names when expanded. The metrics that you are using need to 


have been specified by their use of an expanded function name. 


It is possible for the function names to be specified in a direct manner as 
well. This can be done any time they are imported from the script. The loss 


function then can be used to be one of the metrics. 


Now that we have had some time to look at the regression metrics, it is time 
for us to look at the classification metrics that are found in this library. To 
help us get set up with this one, we are going to take a look at the metrics in 
the Keras library that can be used for some of our classification problems. 


These include: 


1. Binary Accuracy. We will find this in our code as either acc or 
binary_accuracy . 

2. Categorical Accuracy. We will find this in our code as 
categorical_accuracy . 

3. Sparse Categorical Accuracy. This one is going to show up in the 
code as sparse_categorical_accuracy . 

4. Top k Categorical Accuracy. This one is going to show up in the 
code with the top_k_categorical_accuracy . You need to make sure that 


you are specifying the k parameter when you use this. 


5. Sparse Top K Categorical Accuracy. This one is going to show 
up in the code with sparse_top_k_categorical_accuracy . You need to 


take the time to specify your K parameter. 


Accuracy is going to be a special metric that we are going to take a look at 
here. The “acc” metric is going to be specified to report on accuracy 
regardless of the type of problem in question. The code that is given to us 
below is going to demonstrate a binary classification problem and the use of 


the built-in accuracy problem. Let’s take a look at how this will work. 


from numpy import array 

from tensorflow.python.keras.layers import Dense 

from tensorflow.python.keras.models import Sequential 
from matplotlib import pyplot 

# prepare a sequence 


X = array((0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]) 

y = array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1)). 

# create a model 

Mod = Sequential() 

Mod.add(Dense(2, input_dim = 1)) 

Mod.add(Dense(1, activation = ‘sigmoid’)) 

Mod.compile(loss = ‘binary_crossentropy’, optimizer = ‘adam’, metrics = [‘acc’]) 


# train the model 

history = mod.fit(X, y, epochs = 400, batch_size = len(X), verbose = 2) 
# plot the metrics 

pyplot.plot(history.history[ ‘acc’]) 

pyplot.show() 


When you decide to execute this model, you will find that it is going to 


report the accuracy metric at the end of each training epoch. 


As you can see with this library, there are a lot of different things that you 
are able to do to really make it shine and to get a lot out of this library. You 
can work on a lot of different deep learning algorithms and projects with 
this one, even some that are beyond this guidebook! Take some time to 
practice a few of the codes that we are providing in this guidebook and see 


how this will be able to work for you. 


Chapter 8: The PyTorch Basics You Need 
to Know to Use This Machine Learning 


Library 


The last topic that we are going to explore in this guidebook is going to be 
the PyTorch library. We have taken some time to look at the other two 
libraries and some of the great things that we are able to do with them. 
However, this is not all that is available to you, and there are going to be 
some cool things that you are able to do with the PyTorch library to make 
some of your codes need a bit easier as well. Let’s take a look at some of 
the basics that come with this particular library, along with some of the 
other things that you need to know to write some deep learning algorithms 
with the help of this Python library. 


The Computational Graphs 


The first thing that we are going to take a look at when it comes to the 
PyTorch library is the idea of the computational graphs. Deep learning is 
going to be implemented through computational graphs in most cases. It is 
going to be a set of calculations that are called nodes, and these nodes are 
going to be connected in a way that is a directional ordering of the 


computation. 


This may sound a bit complicated here, but what it means is that some of 
the nodes that show up on the graph are going to have to rely on the 
previous nodes, or at least some of the other nodes that are on the graph, for 
their input. This means that these nodes, at the right time, will need to pass 
on the output they have so that their recipient node will be able to receive 


the right input. 


In these graphs, each node can be treated as an independently working piece 
of code. This is done in order to make sure that the performance is 
optimized as much as possible, and it is going to be done similar to 
threating and multiple processing/parallelism. All frameworks that you are 
able to work with through deep learning, including in Theano and 
TensorFlow, work by the construction of these graphs so that the operations 


for the neural network can perform in the proper manner. 


Tensors 


We also need to take a moment to look at tensors. While this may seem like 
something that is only going to happen in the TensorFlow library, it is 
definitely something that we need to stop and look at now. Tensors are 
going to be data structures that look similar to the matrices that we talked 
about before, and they are going to be a critical component for the efficient 


computation in deep learning. 


Another thing to look at is going to be the GPUs or the Graphical 
Processing Units. These end up showing a great level of efficiency when it 


is time to perform operations between the tensors, and this is why it is 


something that a lot of people who are working with deep learning are 


going to be really interested in. 


The neat thing here though is that there are various ways through which we 
are able to declare these tensors in the PyTorch library. Let’s take a look at 


these with the code below to help us get this idea down: 


import torch 
x = torch.Tensor(2,4) 


The above code is going to generate a tensor that is going to be the size of 2 
4, which is going to mean that we have two rows and four columns 
altogether. We can also display this by adding in the “print(x)” to the end of 


the code above to make it show up and create the tensor that we want. 


In addition to this, we can create a tensor of random float values using the 
code of “ x =torch.rand( 2, 4)”. And it is even possible to take these tensors and 
perform some mathematical equations on them. A good example of how we 


are able to do this will be in the code below: 


x = torch.ones(2,4) 
y = torch.ones(2,4) * 2 


x+y 


The Autograd in the PyTorch Library 


The libraries that are used with much of deep learning is going to provide a 
type of mechanism that helps you to calculate the error gradients that you 
have. It will then be able to propagate them backward into the 


computational graphs that we talked about earlier. The PyTorch library is 
going to provide such a mechanism if you make sure it is given the name of 
autograd. The mechanism is going to be easy to access, and it is pretty 


intuitive, which will make it easier for you to use. 


The main component for this kind of system that we will need to focus on 
here is going to be the Variable class. We can easily import in the Variable 
class at any time that we need to when it is time to work on this part of the 
code. The code that you need to make sure the Variable class is imported 
and ready to go will include: “ from torch.autograd import Variable ”. And the work 


will be done. 


In order to bring this a bit further, you may have a time when you would 
like to write out a code that needs a few variables inside of it. When this 
happens, you will want to create some of your own variables because the 
program is not going to do it for you. The code that you need to use in order 


to create some of your own variables will include: 


var_x = Variable( torch.randn((4,3)) 


How to Build Up a Neural Network 


Remember that we spent a bit of time in an earlier chapter looking at these 
neural networks and exploring some of the neat things that we are able to 
do with them. Now, it is time to take this a bit further and see what steps are 
needed if we would like to build up our own neural networks in this library. 


That is one of the neat things that come with the PyTorch library, it is 


possible to write up some of your own neural networks, and it is easier than 


you may think. 


Let’s understand that PyTorch is going to be more of a practical lens that 
you can work with. Learning some of the theory that comes with this is 
going to be good, but it is not going to be that useful to you if you are not 
able to put it into practice. The PyTorch implementation of this kind of 
neural network is going to look pretty much the same as you will find with 
the NumPy implementation. And this is part of the goal that we want to 
accomplish in this section to show you how they are the same and how they 


are different. 


To keep this relatively simple, while making sure that there is enough 
information in place to help us get enough knowledge about this process, 
we are going to create a network that is three layers, and has five nodes in 
our input layer, three inside of our hidden layer, and then one in the output 
layer when we are all done. We are only going to work with one training 
example to help us make this faster, but you can add in as many as you 


want. 


With this in mind, let’s take a look at the code that you need to write out to 
make this happen, and then we will move on to some of the other parts and 


an explanation of what we are doing with this. 


import torch 
n_input, n_hidden, n_output = 5, 3, 1 


The first thing that we need to work on with this one is to make sure we get 


a parameter initialization done. Here, the parameters for the bias and the 


weights on each layer are going to be initialized as the variables of the 
tensor. Remember that the tensors are going to be the base structures of data 
in the libraries, which are used for building different neural network types. 
They can be considered as more of a generalization of matrices and array. 
What this means is that the tensors are going to be matrices are going to be 


N-dimensional. The code that we can take a look at includes: 


# initialize tensor for inputs and outputs 
x = torch.randn((1, n_input)) 
y = torch.randn((1, n_output)) 


# initialize tensor variables for weights 
W1 = torch.randn(n_input, n_hidden) # weight for hidden layer 
W2 = torch.randn(n_hidden, n_output) # weight for output layer 


# initialize tensor variables for bias terms 
B1 = torch.randn((1, n_hidden)) # bias for hidden layer 
B2 = torch.randn((1, n_output)) # bias for output layer 


After we have had some time to do the parameter initialization step, we are 
able to make sure that the neural network can be trained and defined in four 


key steps. These four steps are going to include: 


1. Updating the parameters 
2. Working with backpropagation 
3. Loss computation 


4. Forward propagation 


Let’s take a look at the steps above in some more detail to help us 


understand more about how they are all going to work. 


First, we want to work with the idea of forward propagation. This is a step 
where we are going to calculate out the activations at every layer, using the 
two steps that are going to be listed out below. These activations are going 
to then flow in the forward direction—meaning that they will go from your 
input layer and then head to the output layer to make sure that you end up 


with the final output. The two steps that are needed include: 


1. z= weight * input + bias 


2. a=activation_function (z) 


Then, we can move on to the loss computation. When we enter into this 
second step, the error, which is going to be used here, is called a loss. It is 
going to be calculated inside of the layer for the output. A simple loss 
function can tell the difference between the actual value and the value that 
is being predicted. Later, we are going to look at what the different 
functions for loss are available in the library of PyTorch and how this 


works. 


From here, we can work with the idea of backpropagation. The aim of this 
third step that we are going to talk about is to make sure that any errors that 
could happen are going to be minimized in the output error. We want to 
make sure that any code we are working on and writing will not end up with 
a lot of errors in it, or it will result in a neural network that is not going to 
do what we would like. To make sure that we do this right, we have to make 
some marginal changes in the weights and the biases. These types of 
marginal changes are going to be computed with the use of the derivatives 


of the error term. 


When we look at a principle known as the Chain rule that is found in 
Calculus, the delta changes are then going to be back passed so that they 
end up in the hidden layers. This allows them to correspond changes into 
their bias and weights to make sure that things work out the way that you 
need. This is basically a process that is going to work in order to make 
some of the adjustments that are needed in the bias and the weights so that 


the error can be minimized out as much as possible. 


And finally, we are going to work on updating the parameters that we have. 
This means that in this step, the weights, as well as some of the bias, are 
going to be updated, thanks to the help of the delta changes that we were 
able to get with the third step above. We can then update all of this and get 


the code for the neural network to work the way that we want. 


Once you have been able to get the four steps above to work and be 
executed for the number of epochs that you want, with a lot of training 
examples as well, you will find that any loss that would happen in this kind 
of neural network is going to be kept to a minimum value. The final weight 
and bias values will be obtained, and then you can use these values to help 
you make some good and accurate predictions, even on some data that you 


are not able to see at this time. 


The PyTorch library may be the last one that we can discuss in this 
guidebook, but that doesn’t mean that it is less important than some of the 
others, and it definitely has its place to help you get more done. It is 
especially important in that it helps you to create a lot of the different neural 
networks that you need in deep learning, so it is definitely one of the 


libraries that you need to spend some time on. 


Conclusion 


Thank you for making it through to the end of Deep Learning with Python! 
Let’s hope it was informative and able to provide you with all of the tools 


you need to achieve your goals—whatever they may be. 


The next step is to get a better look at some of the different things that you 
can do with deep learning and explore all of the algorithms, libraries, and 
other tools that you are able to use for your programming needs. This 
guidebook took some time to explore a lot of the different things that you 
are able to do when it is time to code in deep learning, especially with the 
help of some Python libraries—and now, you are on your way to doing 


some of your own deep learning models. 


Inside this guidebook, we spent our time exploring what deep learning is all 
about, as well as why the Python language is the perfect addition to make 
all of this happen. Sure, you can do a lot with machine learning, and even 
with deep learning, with other coding languages—but many programmers 
like the flexibility, the ease, and all the diverse libraries that come with the 


Python language and choose to work with that instead. 


This guidebook went into more details about all of this and spent particular 
attention on what you will be able to do when it comes to deep learning 
while using some of the best-known Python libraries. Whether you want to 
work with TensorFlow, Keras, or PyTorch, you are sure to find some of the 


things that you need—and before you know it, your coding will be done! 


When you are ready to learn more about deep learning and all that you can 
do with this coding language, make sure to check out this guidebook to help 
you to get started. 


Finally, if you found this book useful in any way, a review on Amazon is 
always appreciated! 
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Introduction 


Congratulations on downloading Python for Data Analysis: The crash 
Course for Beginners to Learn the Basics of Data Analysis with Python 
Database Management and Programming with Pandas, NumPy and 


Ipython. 


If you are interested in this book, that means you already have a general 
idea of Python language programming and why it is a crucial to learn 
programming with this language. If you don’t know what Python is and you 
are interested in learning a programming language, you probably will be 
persuaded to be using Python in the future after you read this book and go 


through the examples provided in this book. 


This book aims at providing beginners with the basic tools of Python and 
also intensive courses in order to develop the necessary skills to use the 
widely and fundamental libraries used for data analysis in Python. This 


book does not require any pre-requisite skills of any kind. 


The book has seven main chapters. The first chapter of this book provide an 
overview of the Python programming language. The second chapter 
presents how to install, set and use Python in any Operating system. The 


third chapter aims at an introduction of programming using Python 


language. This chapter presents the fundamentals and basic skills that any 
beginner needs to know to be able to use Python for programming. Chapter 
4 aims at presenting a course on how to use Ipython which is an interactive 
programming environment developed specifically for Python. Chapter 5 is 
an intensive course that focus on using the NumPy package and its 
functionalities in Python. The NumPy package is a library for numerical 
programming in Python. Chapter 6 is an intensive course on how to use the 
Pandas package. This particular package is used widely for data analysis 
and is a fundamental library to master for anyone who is interested in data 
analysis. Chapter 7 is an intensive course of the matplotlib library. Chapter 
7 provide examples of how to plot and visualize data. All chapters are based 


on examples using real data in order to practice programming with Python. 


Plenty books on this subject exist. Thank you for your interest in reading 
this book. This book was developed with every effort to make sure it has 


valuable information for beginners, please enjoy! 
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The Python language is a widely used programming language to develop a 
wide range of applications. This chapter provide an introduction to Python 
language. We will cover in this chapter the basics of Python language as 
well as the characteristics of this programming language. We will explain in 
details how to install Python in different operating system environments and 
give broad view of applications that can be developed with programming 
with Python. First let’s start by exploring what is Python and the 
characteristics of this programming language. 


| Introduction to Python 


Python belong to the category of programming languages that are 
interpreted object-oriented and high level. Python language includes 
dynamic semantics. It is mostly considered as a scripting language. This 
type of programming languages does not require that code developed 
should be interpreted before the run time to a computer-readable format. 
Python was initially developed in 1980s to develop trivial programs. It has 
evolved as an open source code supported by a community. Now Python is 
widely used to develop large commercial applications instead of basic 


programs. 


The fact that Python is a high-level language that has a built-in data 

structures and includes dynamic typing as well as binding makes it a very 
interesting language to develop easily applications. It also makes it a very 
useful language to be exploited as glue language in order to bind different 


existing components together. 


2 Why use Python for programming 


The main question beginners and newbies ask is why use Python 
particularly if there are plenty of program languages available. These days 
Python is widely used and there many reasons that makes Python an 
attractive programming language. However, keep in mind that choosing a 
programming language to develop a specific application depend on the 
constraints or the requirements of the application and also on the personal 
preference and expertise of the developer. In this section, we will go 
through details of reasons and characteristics of the Python programming 


language and what is makes it an attractive tool widely used. 


First of all, Python is a simple language and easy to learn. Python was 
developed primarily to be a readable language based on easy syntax. Users 
can develop, read and translate as well as maintain a Python cod easily 
compared to other scripting languages. Therefore, the cost of maintaining 
and developing a code in Python is decreased and reduced. The reason of 
that is it allows an easy collaboration between team members with no 


language or expertise obstacles. 


Python enhance the productivity of developers. In fact, Python is 
considered as a very productive language coding environment compared to 
other languages that are considered low level and more efficient like Java or 
C and C++. A Python script is around 1/3 or 1/5 of the size of a similar 
code written in Java or C++. This means that programming in Python 
requires less typing and debugging and also less to support and maintain. 
Python programs does not require compiling or to be linked into other tools. 


Running Python programs is very straight forward and direct without a need 


for compiling. Moreover, a Python program is easily to be organized. In 
short, when you develop a Python code, you run it immediately, you get a 
feedback about your code instantly to detect errors and to debug it. Hence, 
when you develop a Python code your run it and execute it faster and 


quicker than other languages. 


Python is a portable language. This means that any program coded in 
Python can be ran with no changes in any computer and system 
environments and platforms. The same Python code can be run on 
Windows, Unix or Mac with no change. Python programs can be run also 
on large servers, Android or iOS tablets. Python can be run on any platform. 
As for the graphical user interfaces Python support options for developing a 


portable graphical user interface. 


Python support dynamic declaration for variables. Dynamic declaration of 
variables or dynamic typed variables means that there is no need to declare 
the variable before its use. The variables can be used directly without prior 
explanation to the computer what the variable should be. So, the dynamic 


typing makes is it easier to develop a program. 


Another attractive aspect of Python language is the support libraries 
available in Python. Indeed, Python comes with a support library that 
contains a wide variety of prebuilt and cross platform functionalities and 
modules. This library is called the standard library. This library has a high 
level of applications and tools to process data and support some basic tasks 
including matching the text patterns, searching for values in an object, 
mathematical basic operations, network scripting and more. Other libraries 


developed by third parties are available and can be included easily within 


the Python programming environment. So, when you are developing a code 
in Python, you are not coding from scratch instead you can use the pre-built 
functions and libraries that are developed by the community that support 
Python. 


Python language supports component integration. In other words, Python 
can be integrated with any other pre-developed code or applications. Python 
can be called from alternative languages particularly C or C++. Python can 


also import and use libraries developed in C and C++. 


To summarize, when Python is used for programming and developing 
applications, you benefit from the speed of execution, program portability, 
easy readability, code organization as well as integrated components and 
developed libraries. All these are aspects and characteristics of Python that 
makes it an attractive tool compared to other programming languages. Now 
you might be wondering what are the strengths and technical characteristics 


of Python? That is, we will answer in the next section. 


3 The strengths of Python language 


In this section we will go into the details of the qualities of the Python 
language programming which makes it an attractive programming tool. 
Python as we describe it in the introduction section is an Object-Oriented 
programming language. Therefore, it benefits from the strengths of the 
notions of polymorphism, multiple inheritance as well as operator 
overloading. These notions are easy to apply and understand with Python 
than any other Object-Oriented programming language. Polymorphism 
means that a function can be used in different forms and types. For 
example, a function can take as an input a number or string. Inheritance, in 
the other hand, is a mechanism that enables to define or build a class called 
a child class which is based on a previous class called parent class by 
appending or adding new attributes to the existing parent class. By doing 
so, the child class inherits all attributes and methods that are defined for the 
parent class. A class is a major notion in Oriented-Object programming that 
provide a tool to create and manipulate an object by assigning to this object 
attributes and methods to process this object and its attributes. Multiple 
inheritance enables to a child class to inherit attributes and methods from 
several classes at the same time. Overloading is characteristic that allows a 
function or a method to perform in different ways according to the variables 
that are fed to this function. Overloading allows to reuse of the same code 
instead of developing the same code several time in order to process and 


perform tasks according to the type of the variable being handled. 


Python is available for anybody that want to use as it is a free programming 
language. Python is also an open source and can be downloaded for free 


and embedded in any system without any costs. Although it is free, Python 


is supported by a community. In addition to the availability of the source 
code, different libraries developed by third parties and the support 
community are available for free to use. These libraries offer different built- 
in tools, methods and functions which give a base for developing programs 
instead of starting from scratch and developing long scripts. You can also 
develop you own libraries or packages. This brings us to the next strength 
of Python. 


Python language supports using modules and libraries. Modules or libraries 
are a set of different methods or scripts. The fact that Python programs can 
be developed in a modular way means that the same code can be re-used 
and applied to a wide variety of applications. When you develop a module 
or a library that you need, this same library can be scaled in order to be used 
for other applications. It is very easy to import and load any module that 


you have developed in working environment. 


Python is considered a powerful hybrid programming language. In fact, it if 
mixes the features of scripting programming languages as well as the 
features of advanced languages that requires compiling. This hybrid aspect 
of Python makes it a powerful tool to develop large-scale applications. 
Python offers dynamic typing which means there is no need for declaration 
of type or size of variables. Python offers also automatic memory 
management. This characteristic allows an automatic memory allocation. 
Python automatically delete the resources that are unused. In short Python 
is an interactive programming language that includes classes, modules and 
exceptions and high-level dynamic data objects of different types. These 
tools enable the programmer to easily organize the code into several 


components and benefit from the Orient-Object programming to reuse the 


code for future utilization and customization of the code for other 
applications. It also allows to manage the events and the errors easily. 
Python also offers the conventional data structures as built-in objects such 
as dictionaries and lists that we will explain in details in chapter 3. These 
objects are simple to use and flexible. These built-in objects can expand and 
shrink depending on their use during the execution of the program. They 
also can be fitted in large complex objects to process and describe complex 
information. Python comes with a powerful and useful packages and 
libraries that offers the tools for the basic and standard operations that 
include mathematical functions, concatenation, mapping and sorting. The 
python available libraries support machine learning and data mining 


processing, numerical programming and much more. 


Another technical strength of programming with Python is the simplicity to 
use with Python codes with other programming languages. In other words, 
Python codes can be glued to other code components developed in another 
language. This means that within a program we can add Python 
functionalities for an end-user application. In contrast, we can add within a 
Python program packages that are developed in C or C++ for example. 
Finally, Python is easy to use as programming language that requires no 
compiling. Python codes are executed and run immediately trough iterative 
programming and rapid error detection and maintenance. Python is an easy 
to learn language. If you have no background in programming, Python is a 
good choice to start learning a programming language and you should be 
able to develop programs in relatively few days. If you already have a 
background in programming language, you should be able to develop 


programs faster and without support. Now that you know the attractive 


features of Python, let’s explore what kind of applications you can develop 
with Python. 


4 What programs can be developed with Python 


Besides of offering powerful tools, Python allows developing a wide variety 
of real-world application. In fact, Python is used in different domains. It is 
used for scripting other components that are integrated in other programs or 
to develop a stand-alone program fully developed with Python. Actually, 
Python has a wide range of potential uses and applications that can be 
coded are unlimited. We will present in this book the basic applications that 
can be develop with Python. But keep in mind you can develop almost 


anything with Python. 


Python programming can be used to develop systems programs that are 
commonly called shell scripts. Because Python is portable, it a suitable tool 
to write these scripts and tools that are generally used to maintain and 
administer systems. These scripts and programs can be used in any 
operating system without any changes to make. So, with Python you can 
develop scripts that fetch files and directories paths, run other programs. 
You can also write Python programs that perform parallel processing and so 


on. 


In fact, the standard library of Python includes POSIX bindings. This tool 
supports all common Operating System’s tools which are the variables of 
the environment, files, expression pattern matching, arguments of the 
command line, expansion of the filename, processes, sockets and much 
more. All these tools can be manipulated in Python to develop system 


programs that copies files or directories or list files of a directory. 


Python can be used to develop graphical user interface (i.e. GUI). Python 
has a built-in package called Tkinter which support GUI programming. This 
library is a standard object-oriented interface that enables developing and 
implementing graphical user interfaces that are portable that can be run in 
any operating system (Windows, UNIX/LUNIX, Mac) without any 


changes. 


The standard library of Python includes standard Internet modules which 
offers and support internet scripting. By internet scripting we mean 
developing applications and programs that achieve a wide range of 
networking jobs in both server and client. Python programs can perform 
tasks such as communicating over sockets, or getting information from a 
server, send files via FTP, XML file processing, handle email (i.e. sending, 
receiving and parsing), sorting and searching pages in internet trough URLs 
among others. In addition to internet scripting, Python also includes library 
that offers Internet programming that allows generating HTML files or 


developing web sites in a relatively simple way. 


We mentioned in before in this section one of the strengths of Python is its 
ability to be integrated within applications developed in other programming 
languages. Therefore, Python is a useful tool to use as glue programming 
language in order to test other applications and its components. For 
example, a library developed in C or C++ can be imported in Python 
environment and tested and run using Python scripts. 

Python programming allows prototyping. In fact, Python programs does not 
distinguish between components that are developed in Python or C for 


example. So, this characteristic of Python allows developing prototype in 


Python then shifting the prototype into a compiled language C or C++ for 
rapid execution. 

Python programming allows numerical programming through its library 
NumPy. We will provide an in-depth discussion of this library in Chapter 5 
of this book. This library allows objects or data structure that make Python 
a high-performance and efficient tool to perform statistical, mathematical 
and engineering computations and programming. Python programming 
allows also database programming. Python includes tools that support 
reading and saving Python data structures into and from files. Python 
includes also interface that support MySQL, Oracle for traditional database 
management. You can also perform game programing, graphics, image 


processing and much more with Python. 


Chapter 2: Installing and setting up Python 





In this chapter, we will explain how you can get started with Python. We 


will go into the details of how to install Python and how to use it ina 


command line or in an interactive interface. 


| How to start with Python 


In the previous chapter we introduced Python as a language that is used for 
programming a wide range of applications. In fact, Python comes in the 
form of a software. This software includes an interpreter that allows to 
translate the code you develop into a language that the hardware is able to 
process and execute. In fact, the interpreter is a layer of the Python software 
that works as interpreter that process information and instructions in a code 
to the machine hardware. 

Once you install Python in your computer, it will produce several 
components including the interpreter and the standard library that comes 
with it. The interpreter of Python comes as an executable. Python 
installation depends on the operating system you use. We will see that in 


details in the next section. 


? Installing python 


Before jumping into installing Python, you need to check if Python is not 
already installed in your machine. In fact, Python might be already 
available and installed in your machine if you are using LINUX system or 
UNIX system. You can write ‘python’ in a prompt shell. If it is not 
available it will return an error and if it is available it will return a ‘>>>’ in 
the prompt which means it is ready to type in Python code. You can also 
check if Python is available in LINUX environment by searching for a 
Python directory in /usr/bin or /usr/local/bin. In windows you can check if 
Python is available by searching for Python in the start button. Make sure to 


have the latest version of Python. 


You can download and install Python via the Python official web site which 
is www.python.org. When downloading Python from the website, make 
sure you get the appropriate version for your operating system. If windows 
you can download the appropriate version, unzip the directory and run the 
executable. In LINUX, you need to unzip the directory with rpm. 

If you are using Windows operating system, Python comes in the form of a 
self-installer. You can run that executable and click Yes for every window 
to install Python with the default settings. The default settings contain the 
documentation of Python and support for the graphical user interface, the 
IDLE development and all the settings that you need. Once installed, 


Python will appear among the programs of the start menu. 


If you are using LINUX, Python may come in several rpm files that you 
need to unzip. In LINUX and UNIX, Python is compiled from the source 


code by unzip the directory and running the config as well as the make 


command. This will allow the configuration of Python automatically. For 
more details, Python comes with a README file that would provide 


instructions to install Python. 


3 How to use Python in command line 


In order to start using Python, a possible way to launch is from the prompt 
of your OS. In windows you can open a DOS console Window. In the 
prompt you can type ‘python’. When starting Python, two lines of 
information where the first line is the Python version used. Below is the 
output you get when you start Python in a prompt: 
C:\Users\***>python 
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit 
(AMD64)] :: Anaconda, Inc. on win32 
Type "help", "copyright", "credits" or "license" for more 
information. 
>>> 
Once you launch a session of Python, it prompts >>> that means that the 
prompt is ready to run instructions you type in. Let’s run an example and 
print for instance ‘this is my first command in Python’: 
>>> print (' This is my first command in Python’) 
This is my first command in Python 
SS 
When running the Python command in a prompt shell like in the example 
above, it displays the results after >>> as presented in the example. In this 
example we run the code in an interactive session. To exit the interactive 
session of Python, you can type in Widows machine Ctrl-Z , on Unix or 


Linux systems you can type Ctrl-D. 


Note that when running the codes in an interactive session like we did, the 
code is executed instantly and not saved anywhere. It is a good way to start 


experimenting and testing codes. To save Python codes you can save the 


code in a file that has ‘.py’ as an extension. These files are called modules. 
You can develop scripts in Python using a text editor like Notepad++. To 
execute a module, you need to type in a prompt shell ‘python file- 
name.py’. We will se in the next chapter how to develop modules and how 
to execute them. 

Note also when running Python in a prompt shell, the output is displayed 
instantly and not saved either in any file. Once you exit the prompt, the 
output is lost. In order to save the output in a file you need to use some shell 
scripting command that allows redirecting an output. In this case redirecting 
to a text file as follows: python filename.py > output.txt 

There are other tools that are more efficient to develop tools with interactive 
programming and save the working environment. We will present that in 
chapter 4 of this book. 


Chapter 3: Programming with Python and Python 
Libraries 
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In this Chapter, we will introduce some basics syntax of Python 





programming as well as some useful libraries that can be used in Python. 
But first, let’s explore the type of objects that are available and can be 
manipulated in Python. 


3.1 Basic programming with Python 


Python has several types of built-in objects and functions to manipulate 
these objects. Data is typically stored in objects. These objects are typically 
manipulated in a statement or expression that set the rules or commands to 
apply to an object. A set of these statements is called a module which form 
a program. The statement or expressions typically create and process the 
Python objects that we will discuss below. These objects are what is used to 


manipulate the data. 


In Python we can distinguish six different built-in objects namely numbers, 
strings, dictionaries, tuples and files. A number object can be either an 
integer, a float or a complex number. Strings are basically a chain of 
characters. Lists are objects that can contain an ensemble of other objects 
numbers or strings or lists. Dictionaries can also contain an ensemble of 
other objects numbers, strings, lists or other dictionaries. Lists and 
dictionaries are both indexed and can be iterated through. However, the 
difference between the lists and dictionaries is the how the items or objects 
are stored in and how they are fetched. In a list, objects are ordered and be 
accessed to by position. In a dictionary, objects are saved and can be 
accessed to by a key. Tuples are similar to lists and objects are ordered and 
can be accessed to by position. Finally, Python support creation and reading 
files objects. All these objects can be manipulated and processed by built-in 
functions that Python offers. We will go through the details of each Python 


object in this following section and how we can declare objects in Python. 


Data structure or object declaration in Python 


Remember that Python does not require any variable declaration or size or 
type declaration. A variable is created once a value is assigned to it. We can 
also change a variable type by assigning another value to it. Let’s create for 
example a number variable A: 
SSS A=5 
>>> print (A) 
5 
Now we can also change the variable type A to a string by assigning a 
string value, for example: 
>>> A= 'Hello World !' 
>>> A 
Hello World ! 
In the previous example, we assigned to the variable A a number value, 
then a string a value. Therefore, we have changed the type of the variable 
after it was declared. The function print() we used in the example above, as 
you probably have guessed, allows to show the output of a statement. To 
verify the type of any object in Python we can simply call the function 
type(). This function takes as input an object and returns the type of the 
object. So, if apply this function to our variable A and print the output, we 
get: 
>>> print (type (A)) 
<class ‘int '> 
Note that we can also declare several variables of different types in a single 
statement as follows: 
>>>A,B,C=100,'Banana’,10.5 
Now we can check the contents of each variable by printing each variable: 
>>> print('A is equal to :', A) 
A is equal to : 100 


>>> print(' B is equal to :’, B) 
B is equal to : Banana 
>>> print('C is equal to :’, C) 


C is equal to: 10.5 


And we can check the type of each variable as follows: 
>>>print(‘Type of A is:’, type(A)) 
>>>print(‘Type of B is:', type(B)) 
>>>print(‘Type of C is:', type(C)) 

The output of this statement is: 

Type of A is: <class ‘int'> 


Type of B is: <class 'str'> 


Type of C is: <class 'float'> 


Note here that in order to declare a string variable both single or double 
quotes can be used. For instance: 
>>> x,y='Banana',"Orange" 
>>> print ('x is:', x, 'and y is:’, y) 


x is: Banana and y is: Orange 


There are some rules you should consider to follow, when you want to 
name any variable or object in Python. First in order to name a variable 
only alpha-numeric characters and underscores can be used. For instance, 
you can name a variable as follows: A_10. A variable name should never 
start by a number. Finally, keep in mind that variables in Python are case 
sensitive which means that age, Age and AGE for example are three 
different variables. For example: 

>>>age,Age, AGE=9,10,11 


>>> print (‘age is:', age) 


>>> print( 'Age is:', Age) 
>>> print( 'AGE is:', AGE) 
The output is: 
age is: 9 
Age is: 10 
AGE is: 11 
Now that you know how to declare and assign value to variables, let’s get 


into the details of each object type or data structure type. 


Number data structure in Python 

The number object or data structure is the most fundamental object in any 
language programming. This data structure is used to store any numeric 
quantity. Python supports the basic number or numerical values namely 
integer and float numbers. It provides the functions to process them and 
high-performance libraries are available to perform more advanced 
numerical calculations. In this section, the basic Python functions are 
presented in details. The advanced numerical library is detailed in chapter 5 
of this book. In addition to that, Python also supports complex data and 


unlimited integer precision. 


In the following table, we present the basic types of number that are 
supported in Python: 


Ramee Beams 





Long integer with unlimited 
precision 222222222221, 


Floating point 1.5, 1.5e-10, 5E100, 


1.0e+100 


Complex number 1+2j, 4.0+2.0j, 2J 





The integer is basically declared as a string of decimal digits while floating- 
point number integrate a decimal point and optionally an exponent that is 
inserted with E or e. If a number in Python is written with an exponent, it 
considers it as a float and uses float-point math function when it performs 
operations. Now that you know the type of the number objects in Python, 
let’s see what operations and mathematical functions can be used. 

We can apply the logical or and logical and on the number objects. For 
instance, if we have two number objects X and Y, these operations can be 
applied (X and Y), (X or Y). In the logical (X or Y), Y is not evaluated 
unless X is false. In the logical and (i.e. X and Y), Y is evaluated unless X 
is true. We can use the logical negation by applying not X. We can use the 
comparison operators: 1) ‘ < ‘ (for strictly inferior to); 2) ‘<=’ (i.e. for 
inferior or equal to), 3) ‘ > ‘ (for strictly superior to), 4) ‘ >= ‘ (for superior 
or equal to); 4) ’ == ‘ ( is equal to), 5) ‘ != ‘ (for different than), 6) ‘ is ‘ (for 
object comparison), 7) ‘ is not ‘ (for object comparison), 8) ‘in’ (for 
belonging to other object), 9) ‘ not in ‘ ( for not belonging into other 
object), 10) ‘| ‘ (i.e. bitwise or); 11) ‘ 4 ‘ (i.e. bitwise exclusive or ), 12) ‘ 
& ‘ (i.e. bitwise and). 

You can also apply the mathematical basic operations: 1) X + Y (i.e. 
addition), 2) X — Y (i.e. subtraction), 3) X % Y (i.e. remainder), 4) X * Y 
(i.e. multiplication) 5) X // Y (i.e. division), 5) +X (is identity and equals to 
X),6) -X (i.e. negation) 7) X ** Y (i.e. power). 


When use multiple operations in a single expression or instruction like 


X+Y*Z, Python follows the same rules mathematical rules. It performs 


multiplication operations first. In the expression presented above Python 
will perform Y*Z then result is added to X. Therefore, remember to add 
parenthesis in order to specify the order in which the operations are to be 
performed. In other words, the following statements 1)X + (Y * Z) , 2) (X + 
Y) * Z are different. These two operations yield to very different results. 
When you add parenthesis, Python is forced to evaluate what is inside the 
parenthesis first. 
Now let’s see some examples. You can open a prompt operating system and 
run python as we have learnt in the previous chapter to run the examples: 
C:\Users\***>python 
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit 
(AMD64)] :: Anaconda, Inc. on win32 
Type "help", "copyright", "credits" or "license" for more 
information. 
>>> 
Let’s create two variables: 
So xe 5 
>>> Y=6 
We can perform operations in a single line statement and the result will be a 
tuple displayed between parenthesis: 
>>>X+1,X-1 
(6, 4) 
Here we added and subtracted 1 to the variable X 
>>> XK *4,X/4 
(20, 1.25) 
We multiplied and divided the variable X by 4. 
>>> X%4,K ** 4 
(1, 625) 


We computed the modulus and the power of X by 4 

>>> X+7.0, 3.0**X 

(12.0, 243.0) 
In the above example we used mixed type of numbers in a single statement. 
Note Python will return a float type number when operations are performed 
on integer and float-point numbers. 
If you are using a variable name in an expression that was not defined 
before, Python will throw an exception or error like presented below: 

>>> X + PP 

Traceback (most recent call last): 

File "<stdin>", line 1, in <module> 


NameError: name 'PP' is not defined 


Now let’s explore how Python reacts when multiple operations are passed 
in a single statement and the importance of placing parenthesis in the right 


place. We will be working with the same variables X and Y we created 


before: 
>>> X+4*2-Y 
fi 
>>> XK + (4* 2)-Y 
7 
>>> (X + 4) * (2- Y) 
-36 
>>> X +4 * (2-Y) 
-11 


We can see that the output from the first statement X + 4 * 2 — Y and the 
second statement X + (4 * 2)-Y is the same. That is because Python will 


always start by evaluating the multiplication first. In the third statement (X 


+ 4) * (2 - Y), Python evaluates the expressions between parenthesis then 
evaluate the multiplication of the results. In the fourth statement X+ 4*(2- 
Y), Python starts by evaluating the expression between the parenthesis, then 


the multiplication and finally the addition. 


Python has some built in functions that are available in the math module. 
First, the math module should be imported in our working environment (we 
will explain in details modules later in this chapter). 
>>> import math 
This module has the pi number available and can be loaded as follows: 
>>> math.pi 
3.141592653589793 
It also has the trigonometric functions: 
>>> math.cos(math.pi) 
-1.0 
>>> math.sin(1) 
0.8414709848078965 
>>> math.tan(1) 
1.5574077246549023 
The math module has also a pow() function that computes the power of a 
number: 
>>> pow(2,2), 2 ** 2, pow(2,5) 
(4, 4, 32) 
We can compute the absolute value of a variable with abs() function, 
>>> abs(-60) 
60 
We can compute the integer part of a float number with the function int(): 
>>> int(89.4) 


89 

You can also round a float number with the function round(): 
>>> round (3.4), round (3.9) 
(3, 4) 


String data structure in Python 
Strings are the second most basic data structure in any programming 
language. In Python there is only string type to store a sequence of 
characters and there is no object to store a single character like in C for 
instance that has the type char. The string object is immutable which means 
that once created the size of the string object cannot be changed. Characters 
are ordered from left to right. Basically, a string is an array of characters. A 
string can be defined using a single quote, doubles quotes or triple quotes 
for a block of strings. Let’s see some examples of defining strings: 

>>> A ='My simple quote string’ 

>>> print (A) 

My simple quote string 

>>> B= "My double quote string" 

>>> print (B) 

My double quote string 

>>> C=" My triple quotes string 

... extends on two lines""" 
>>> print(C) 
My triple quotes string 


extends on two lines 


Note that using a single or a double quote to define a string is the same and 


yield to the same object: 


>>> 'My single quote’, "My single quote" 
(‘My single quote’, 'My single quote’) 


Python, unlike other programming languages like C, has built-in functions 
and methods in order to manipulate and perform operations on strings 
objects. 
The operator ‘+’ can be used to concatenate two strings. For examples: 
>>> A ='My first string’ 

>>> B ='and my second string’ 

>>> C= A+B 
Now let’s display the result: 

>>> print (' My concatenated string is:'), C 

My concatenated string is: 


(None, 'My first string and my second string’) 


We can repeat the same sequence of string using the operator ‘*’ as follows: 
>>> C= A*2 
>>> print (' My first string 2 times is:'), C 
My first string 2 times is: 
(None, 'My first string My first string’) 


We can get the length of the string using the function len(): 
>>> L =len(A) 
>>> print(‘ The string A has a length of: ', L) 
The string A has a length of: 15 


Remember that a string is an array of characters so, we can get a single 


characters or range of characters in a string by indexing and slicing: 


>>> x = A[0] 
>>> print (‘The first character in A is: ', x) 
The first character in A is: M 
>>> X = A[0:2] 
>>> print (' The first 2 characters in A are: ', X) 
The first 2 characters in A are: My 
We can search for characters in a string using the function find(). For 
example, let’s search for ‘str’ in the string A we created before: 
>>> A.find('str') 
9 
The find() function outputs the location of the first character in the 
sequence we are looking for. 
>>> A[9] 
Ic! 
We can replace a sequence of characters in a string using the function 
replace. To apply this function, let’s replace the sequence ‘first’ by ‘third’ in 
the string A: 
>>> C=A.replace(‘first’, third’) 
>>>C 


‘My third string’ 


We can also split a string using the function split(). This function will split 
the string when it finds a space between a sequence of characters. If we split 
for example the string A, we get the following output: 

>>> C = A.split() 

>>>C 


[My’, ‘first’, 'string'] 


In the next sections, we are going to cover lists and dictionaries data 
structures which are a set of other data structures. These two data structures 
are the main objects used in the majority of Python scripts. Lists and 
dictionaries are very flexible data structures that can be changed, expand 
and shrink if requested and can contain any other type of other data 


structures. We will start by lists in the next section. 


List data structure in Python 

Lists are a flexible data structure that can contain items with different data 
structure. The items are stored in lists in order from left to right position. 
Like in strings, items in lists can be accessed by indexing. Because items in 
lists are in order according to their positions, you can perform 
concatenation and slicing. Unlike strings, the size or length of a list can 
change on demand and may contain several types of data not just characters 
or a single data type in general. Lists are heterogenous. In addition of this, a 
list can contain other lists. Because lists are mutable which means that you 
can modify them after they are defined, all operations that can be performed 
in strings (i.e. slicing, indexing, concatenation) as well as operations such as 
index assignment and deletion can be applied to lists. Now that you 
understand that principle of lists, let’s see examples of operations that can 
be performed on lists. Keep in mind that items in lists are written in Python 


between brackets. 


An empty list can be created as follows: 
>>> list1 = [] 
>>> print (' The empty list is:’, list1) 
My empty list is: [] 


We can create list of values by writing the items between brackets as 
follows; 
>>> my_first_list = [5,6,3,4] 
>>> print (' This is my first list in Python:', my_first_list) 
This is my first list in Python is: [5, 6, 3, 4] 


The first element of the list can be accessed by its position. Remember that 
in Python the indexing start by 0. 
>>> print (' The first item in my first list in Python is: ', x) 


The first item in my first list in Python is: 1 


We can apply slicing on lists to get a range of items from a list: 
>>> x = my_first_list[0:2] 
>>> print(' The first 2 items in my first list in Python are:', x) 
The first 2 items in my first list in Python are: [1, 2] 
Note that in Python, the item that corresponds to the last index in slicing is 
not returned. In other words, if we pass in list L indices i:j as follows L[i:j], 
Python will return items from i to j-1. 
We can modify elements in a list using indexing or slicing. For examples: 
>>> my_first_list[1:3]=[5,5] 
>>> print (' My new list after slice assignment is:', my_first_list) 


My new list after slice assignment is: [1, 5, 5, 4] 


The operator ‘+’ can be applied to concatenated two lists. For example: 
>>> my_first_L = [1,2,3,4] 
>>> my_second_L = [5,6,7,8,9] 
>>> Concat_L = my_first_L+my_second_L 


>>> print ("The concatenated list is:', Concat_L) 


The concatenated list is: [1, 2, 3, 4, 5, 6, 7, 8, 9] 
A list can be repeated several times using the operator ’*’. For examples 
let’s repeat the first list we created 3 times: 
>>> A =my_first_L * 3 
>>> print (' My first list repeated 3 times is:', A) 
My first list repeated 3 times is: [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4] 


Python offers several functions that allows inspecting lists objects, like 
length , sorting, index, and others that we are going to cover in the 


following examples. 


To compute the list the len() function can be used as follows: 
>>> L = len(my_first_L) 
>>> print (' The length of my first list in Python is:’, L) 
The length of my first list in Python is: 4 
Items in a list can be sorted using the function sort(). For examples: 
>>> my_L = [9,3,4,1,0] 
>>> my_L.sort() 
>>> print(' The sorted list is as given as:', my_L) 
The sorted list is given as: [0, 1, 3, 4, 9] 
Items in a list can be reversed using the function reverse(). For example, 
let’s reverse items of our previous list: 
>>> my_L.reverse() 
>>> print(' The reversed list is as follows:', my_L) 
The reversed list is as follows: [9, 4, 3, 1, 0] 
To add an item into a list the append() function can be used like in the 
example below: 
>>> my_list.append(10) 


>>> print(' My list with appended item is:', my_list) 
My list with appended item is: [9, 4, 3, 1, 0, 10] 
The append function is different than concatenation using the operator ‘+’. 
In fact, concatenation expect as input two lists while append takes a value 
and add it to the list. The yield to same result in different ways. Another 
way to add items in a list is using the function extend() as follows: 
>>> Original_L = [0,1,3,4,9] 
>>> Original_L.extend([10,11,12]) 
>>> print (‘My list with added items with extend is:’, 
my_original_list) 
My list with added items with extends is: [0, 1, 3, 4, 9, 10, 11, 12] 
The last item of a list can be deleted using the function pop() like follows: 
>>> Original_L.pop() 
12 
>>> print(' My list with last item deleted is:', Original_L) 
My list with last item deleted is: [0, 1, 3, 4, 9, 10, 11] 


Note that pop() function returns the last item which was deleted. An item 
can be deleted from a list according to its position using the function del. 
For example, let’s delete the last two items from the list Original_L: 
>>> del Original_L[5:] 
>>> print (' The list with the last 2 items deleted with del is:’, 
Original_L) 
My list with the last 2 items deleted with del is: [0, 1, 3, 4, 9] 


Dictionary data structure in Python 
Dictionaries are similar to lists that they enable to store multiple objects of 


different types. However, dictionaries are different than lists in the way 


items are stored in. In dictionaries items are not ordered and they can be 
fetched only using key unlike lists that use position to fetch items. The fact, 
the dictionaries are built-in type data structure, they come with different 
methods and operators that makes is easy to manipulate. 
Dictionaries assign for each item a key that used to access this item. The 
same form of indexing as lists is used get an item but instead of position a 
key is used. The Items do not follow any order. In fact, Python stores items 
in a dictionary in a randomized order that allows rapid look up. Keys are 
only used to allow a symbolic location of elements of dictionary rather than 
a physical position. The length of a dictionary can change (i.e. increase or 
decrease) after it is defined. Items in a dictionary can be changed. However, 
dictionaries do not support slicing and usual operations applied to lists and 
strings because items do not follow any order. Dictionaries are written 
between braces where for each key defined between quotes is assigned an 
item or data structure. So, if we want to define an empty dictionary, we just 
type: 

>>> my_dictionary = {} 

>>> print(' My empty dictionary is:', my_dictionary) 


My empty dictionary is: {} 


A dictionary can be defined as follows: 
>>> my_dict = {'name': 'John’, 'age': 10} 
>>> print (‘My first dictionary of 2 items is:', my_dict) 


My first dictionary of 2 items is: {'‘name’: ‘John’, 'age’: 10} 


Here we defined a dictionary of two items that has name and age as keys. 
The two items are single variables where the first one is a string and the 


second is an integer. We can define dictionary with two items where the 


items are lists. Let’s take our previous example and create a dictionary from 
a list of names and ages: 
>>> name_list = ['John’, 'Brian', 'Mark','Alex'] 
>>> Age = [10,20,30,40] 
>>> my_dict = {'‘name': name_list, 'age': Age} 
>>> print (‘My first dictionary of 2 items from lists is:', my_dict) 
My first dictionary of 2 items from lists is: {‘name’: ['John’, 'Brian’, 
"Mark', 'Alex’'], 'age': [10, 20, 30, 40]} 


An item of dictionary can be accessed by key. For the dictionary we 
previously defined, we can access the first item which name as follows: 
>>> A = my_dict['name’] 
>>> print(' The first item of my dictionary is:’, A) 


The first item of my dictionary is: ['John’, Brian’, Mark’, 'Alex'] 


In order to get the keys of a dictionary, the function keys() can be used: 
>>> my_dict.keys() 
dict_keys(['name’, 'age']) 


To search for a key in a dictionary we can use the follow statement : key in 
dictionary. So, if we want to check if items are part of our dictionary we 
defined before we can do: 

>>> F ='name' in my_dict 

>>> print (' Is the key name in my dictionary?’, F) 

Is the key name in my dictionary? True 
>>> F ='address' in my_dict 
>>> print (' Is the key address in my dictionary:', F) 


Is the key address in my dictionary? False 


Like in lists, to check the size of dictionary the function len() can be used: 
>>> L = len(my_dict) 
>>> print (‘The length of my dictionary is:’, L) 
The length of my dictionary is: 2 


In order to get the values of items stored in a dictionary the function 
values() is available: 
>>> my_dict.values() 
dict_values((['John’, 'Brian', ‘Mark’, 'Alex'], [10, 20, 30, 40]]) 


Like lists, items in a dictionary can be deleted using del function. In 
dictionary items are deleted using their key. For example, let’s delete the 
item age from our dictionary we created in this section: 
>>> del my_dict['age'] 

>>> print (‘My dictionary with item age deleted with del function is:’, 

my_dict) 

My dictionary with item age deleted with del function is: {'name’: 

[‘John', ‘Brian’, 'Mark', 'Alex']} 


Now that you know the basic data structure in Python, let’s explore how we 


can develop a script. 


Statements, Modules and functions in Python 
In this subsection, we will present the statements that you can use in Python 
and how you can handle the data structure we presented previously in this 


chapter. Statements are the instructions that are provided in order to 


perform in data structures. A set of instructions or statements form a 
module or a function that performs specific or several tasks. An ensemble 
of modules and functions form a program that allows performing or 
replicating a system. 


The basic statement that can be performed in any programming language is 
assignment like we did before when we presented Python data structures. 
Assignment consist of attributing to a variable a value. Another basic 
statement is printing data structures or variables with the function print. For 
examples: 

>>> A=5 

>>> print (' This is an example of assignment and printing a 

variable:', A) 


This is an example of assignment and printing a variable: 5 


Other common and very useful statements are: the if/else statement, and 
iterations statements for/else and while/else. The if/else statement is defined 
in Python as follows: 

If <condition1>: 

<statement1> #block of tasks to run when condition] is satisfied 

elif <condition2>: #optional 

<statement2>#block of tasks to run when condition2 is statisfied 

else: 

<statement3> # block of tasks to run otherwise 
For example, let’s verify the value of the variable A we created if it is equal 
to 0 or not: 

>>> if (A==0): 

print (‘A = 0) 


... else: 
print (‘A > 0') 


A>0O 


Note here because we are running in a command line and in iterative 
session, Python use ‘... ‘ for a new line statement. Note also indentation is 
very important when programming with python. In Python programming, 
indentation is used to indicate that a set is a block of code. If indentation is 
skipped, Python throws an error. For example, let’s run the same code 
before without indentation: 

>>> if (A==0): 

... print (‘A = 0') 

File "<stdin>", line 2 
print('‘A = 0') 
A 


IndentationError: expected an indented block 


Looping statements are also very common in programming languages. The 
looping statements allows repeating an action several times. There are two 
main looping statement while loop and for loop. The while loop serves as 
general looping when the number of iterations is not known beforehand. 
This loop repeats an action over and over as long as a condition is satisfied. 
The second loop, the for loop, is designed specifically to iterate through a 
sequence of items and perform an action for each item. When using the for 
loop the number of iterations is known before hand. The for loop general 


syntax is given below: 


for <item> in <object or data structure>: # assign items form object to 
item 
<statement> # block of action to perform on the item 
As an example, let’s write a for loop that computes the power of numbers in 
a list: 
>>> my_L = [5,6,2,3,4,5] 
>>> print ('‘my_L is:', my_L) 
my_L is: [5,6, 2, 3, 4, 5] 
> Lf] 
>>> for index in my_L: 
P = index* index 
L.append(P) 


>>> print (‘The result is:’, L) 
The result is: [25,36, 4, 9, 16, 25] 


The syntax of the while loop is as follows: 
while <condition>: #condition to run to loop 
<statements> # block of code to run 
else: 


<statement> # code to run when the loop exit 


Unlike the for loop, the while loop is not run unless the condition is 
satisfied. 
For example, let’s print the value a variable as long as its value is less than 
10. 

>>> index = 1 


>>> while (index<10): 


print(index) 
if index==10: 
break 


index+=1 
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Functions are a block of a code that can be used with different input values. 
Indeed, functions are very handy when you have repeated tasks to perform 
within a program with different type of values or data structures. The 
general syntax to define a function is as follows: 
def function_name(inputs_argument): 
<statements> 
For example, we a function taking as input 2 variables and returns the sum 
of these 2 variables can be defined as: 
>>> def sum_fct(x,y): 
c=x+y 
return c 
To call the function, all you have to do is typing the name of the function 


with input argument. For the example of the function we develop in the 


previous example: 
>>> p = sum_fct(3,4) 
>>> print (‘The sum is:', p) 


The sum is: 7 


You can develop an ensemble of functions that you need and save those 
functions in file with ‘.py’ extension. This ensemble of functions is called a 
module that forms the file. This file is called by the import statement. The 
syntax is as follows 

import module 
To call a function from the following statement should be used: 


module.function_name. 


Now that you have a general idea of modules in Python let’s explore some 


crucial libraries in Python used for data analysis in the next section. 


3.2 Introduction to Python Libraries 


The Python libraries are a sort of modules that offers a very useful functions 
and classes of data structures. The very useful libraries in Python for data 
analysis are NumPy, Pandas, Matplotlib. This section provides an overview 
of these libraries that will be discussed in-depth in separate chapters. The 
NumPy library stands for Numerical Python which offers a powerful tool 
for scientific and numerical computation in Python. This library introduces 
an array data structure which similar to lists with the exception that arrays 
are formed by items of the same type. This data structure is very efficient to 
handle data in numerical computing. The Pandas library is another useful 
library that is designed specifically for data analysis. The Pandas library 
offers other types of data structures that allows efficient tool for data 
analysis compared to the basic tools of Python. It also provides different 
functions to handle data and detecting missing values. The Matplotlib is a 
very handy library for plotting data, creating figures and for data 
visualization in general. It allows to detect correlation among variables, 
create histograms and distribution of data to new name few features of this 
library. These libraries are dependant on each other and should be imported 
to the workspace together. But first these libraries do not come by default 
with Python and need to be installed individually. To install a library, the 
presented commands below can be used: 

>>> pip install NumPy 

>>> pip install pandas 

>>> pip install matplolib 
To import these libraries, you use the following commands: 

>>> import NumPy as np 


>>> import pandas as pd 


>>> import matplolib as plt 


Chapter 4: Ipython intensive course 


In this section, we will present how to use Ipython for interactive 


programming. 


4.1 Introduction and installing Ipython 

Ipython called now Jupyter is an interactive programming environment. In 
fact, it is a web interface that allows interactive programming that allows 
using different languages including Python. Ipython can be installed easily 
from Anaconda. Anaconda is freely available at the official website 
www.anonconda.com/download. This package is available for all different 
operating systems. Ipython can also be installed from python with the 
following command pip install Ipython. Ipython can be launched from 
Anaconda Navigator or from Anaconda prompt. You can launch Anaconda 
prompt or navigator from the search bar. In Anaconda prompt, you can start 
Ipython by typing in the prompt: Ipython. You will see that it changes the 
input and output are in different colors, and each input is numbered. The 
figure below shows how the prompt look like when Ipython is launched 


from Anaconda prompt. 


MB Python: ¢ users - o 





Example of Ipython launched from Anaconda prompt. 


To launch Ipython from Anaconda navigator, you launch JupyterNotebook. 
Then, it redirects you to a window in your navigator. In the notebook option, 
you select python 3 which redirects you into a new tab in your internet 


navigator. You should have something similar to the screenshot below: 
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Example of Jupyter notebook. 
Either you run Ipython in internet navigator or a prompt, Ipython runs in the 
same way and has similar structure as we have seen. The code is typed in 
cells. As you might have noticed that the prompt of Ipython is different than 
the prompt of python. In a command line, python runs with >>> prompt 
where Ipython runs with a prompt that has a structure as follows: 

In[1]: 

Out[1]: 
The Jupyter notebook is very useful tool to share code with other users. In 
fact, Jupyter is an easily way that uses both markdown and executable 
source code of Python in one single canvas that is named notebook as we 


can see from the figure above. 


4.2. Jupyter notebook 

In Jupyter the code is typed in cells like indicated in the figure above that 
shows an example of Jupyter notebook. These cells are marked as IN[]. 
When you type a code in a cell you can run the code by clicking on the run 
button or use the command pallet of SHIFT+ENTER. When you execute a 
cell, it adds another cell automatically as follows: 


ba ju pyter Untitled30 cast Checkpoint: il ya une minute (uisaved changes) al Logou 
Edit Jie nsert Cell Kermel Widgets Helr t ra Python 3 
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In [1]: PA import pandas as pd 


In [{ }: y | 


In the figure above, we imported the library Pandas. After executing that 
cell, Jupyter automatically numbered that cell and added a new cell. When 
we exceute a command that provides in input, Jupyter display the output 
right after the input cell. Let’s print for example a statement and see how 


Jupyter will display the output. 
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In [1]: DP import pandas as pd 


In [2]: W print("Testing Jupyter Notebook’) 


Testing Jupyter Notebook 
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In Jupyter notebook, we can typing a block of code in a single cell ina 
more efficient way compared to the basic intercative command line 


interactive programming of Python. ~ 
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In [1]: DW import pandas as pd 


In [2]: BW print("Testing Jupyter Notebook’) 


Testing Jupyter Notebook 


In [3]: W L=[{1,2,3,4,5] 
for iin tL: 
print (‘my number in list is:', i) 


my number in list is: 
my number in list is: 
my number in list is: 
my number in list is: 
my number in list is: 
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In the example, in a single cell we created a List and displayed the item of 
the cell. Programming in Jupyter notebook follows the same rules as in 
Python. Indentation is also applicable in Jupyter and if is not used, Jupyter 


will throw and error like in the following example: 
=~ jupyter Untitled30 posed) Logout 
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In [4] HH import pandas as pd 


In [2] DH print¢{' Testing lupyter Notebook") 
Testing Jupyter Notebook 
In [3]: bp t=[2,2,3,4,5] 
for i in Lt: 
print ("sy number in list is:", 1 
my number in last is: 
my nuaber in list 1 
my number in list is; 
1 
1 


my number in list 
my number in list is: 


In [5]: MW be[2,2,3,4,5] 
for i in L: 
print (‘My number in list is:,‘, i) 


File “<ipython-input-5-217f@fddabab>", line 3 
peint ("My number in list is:,', 4) 


IndentationError: expected an indented block 


Note also that in Jupyter, the Python functions are in a different color (i.e. in 
green) and everything that will be displayed in an output or between °’ is in 


another color (i.e. red). 


4.3. Jupyter functionalities 


Jupyter notebook provides different useful functionalities that allows 
assessing the performance of Python code. These functionalities are called 
magic commands. Th magic commands allow to evaluate the timing and 
resources that the code requires. It is very handy in order to optimize the 
code scripts. The magic commands of the Jupyter Notebook are %time and 
%timeit, %%heat, %memit, Yomprun, %prun and %lprun and snakeviz. 
Overall, this command provides information of how much CPU and 
memory the code used. These elements are very important when developing 
optimization codes, simulation and numerical codes in engineering fields. 
Let’s consider for example, we want to estimate the number IT using Monte 
Carlo simulations. Monte Carlo simulation is a numerical method that aims 
at estimating a value by the mean value of random samples of a data. 
Remember that IT/2 is exactly the surface of half a circle of radius 1. If we 
plot a square with a length of 1 over the half circle, the ratio between the 
surface of the square and the half circle is TI/2. When using Simulation 
Monte Carlo, we need to specify the number of simulations or more 
precisely the number of times we repeat the process of estimating the target 
value. Monte Carlo simulation is known to be very exhaustive and time 


consuming. 


In order to apply Monte Carlo simulations, we develop the following 
function that estimates IT. To do so we need to import random in order to 
generate random values. 

from random import random 

def Approximate_pi(n=1e5) -> "Area": 

#Give an approximation of pi using Monte carlo simulations 


#Input Argument: number of simulations 


in C=0 
Tot =n 
while n != 0: 
A= random() 
B = random() 
if pow(A, 2) + pow(B, 2) <= 1: 
in_C += 1 # inside the circle 
n-=1 


return 4 * in C/ Tot 


In order to evaluate the runtime of the function we call the function using 
%time as follows: 


%time Approximate_pi() 


When running the function with % time as in command above we get the 
following output: 

Wall time: 50.8 ms 

3.14772 
So, the runtime of the function that estimate the number IT is 50.8 ms. The 
magic command %time is very useful to compare the runtime for different 


functions. 


We can normalize % time using instead the magic command %timeit with 
the flag -r that shows the number of runs and the flag -n for the number of 
loops as follows: 

%timeit -r 2 -n 5 Approximate_pi() 


By using the magic command %timeit, we get the output provided below: 


61.6 ms + 2.45 ms per loop (mean + std. dev. of 2 runs, 5 loops each) 


So, the function that we created to approximate IT takes 61.6 milliseconds. 


We can actually breakdown the runtime according to the functions and 
actions taken in a function. This is very handy to understand what makes a 
code run slower. To do so, we can use the magic command %prun as 
follows: 


%prun Approximate_pi() 


The output of this magic command is as follows: 


400004 function calls in 0.110 seconds 
Ordered by: internal time 
ncalls tottime percall cumtime percall filename:lineno(function) 
1 0.063 0.063 0.110 0.110 <ipython-input-17- 
ffad3c3608c0>:2(A pproximate_pi) 
200000 0.035 0.000 0.035 0.000 {built-in method builtins.pow} 
200000 0.011 0.000 0.011 0.000 {method 'random' of '_random.Random' 
objects} 
1 0.000 0.000 0.110 0.110 {built-in method builtins.exec} 
1 0.000 0.000 0.110 0.110 <string>:1(<module>) 
1 0.000 0.000 0.000 0.000 {method 'disable' of '_Isprof.Profiler’ objects} 


As you can see from the output of the %prun we get the runtime for each 
function in the procedure we created. For instance, calling the procedure 
Approximate_pi() takes 0.110 seconds, and computing the power function 


takes 0.035 seconds. It also provides the frequency of each function was 


called (ncalls), the overall run time (tottime), the time to call the function 
(percall) and the cumulative time that includes the call of all functions 
(cumtime). When running the magic command as we did, it displays the 
output in the notebook. If we whish to save the output for later use, we can 
save it by passing the flag -D as follows: 

%prun -D Approximate_pi() 


We may also sort the output using the flag -s to sort the cumulative time as 
follows: 


%prun -s cumulative Approximate_pi() 


Now, the output is: 

400004 function calls in 0.097 seconds 
Ordered by: cumulative time 
ncalls tottime percall cumtime percall filename:lineno(function) 
1 0.000 0.000 0.097 0.097 {built-in method builtins.exec} 
1 0.000 0.000 0.097 0.097 <string>:1(<module>) 
1 0.053 0.053 0.097 0.097 <ipython-input-17- 
ffad3c3608c0>:2(A pproximate_pi) 
200000 0.034 0.000 0.034 0.000 {built-in method builtins.pow} 
200000 0.010 0.000 0.010 0.000 {method 'random’' of '_random.Random' 
objects} 
1 0.000 0.000 0.000 0.000 {method 'disable' of '_Isprof.Profiler'’ 
objects} 


You can notice from the output above, that result is sorted according to the 


cumulative time in descending order. 


The magic command Iprun provide the runtime for every line of the code. 
This magic command is not installed by default. So, you need to install the 
line_profiler module in order to use this magic command as follows: 

!pip install line_profiler 
Then run the command with 


%lprun -f Approximate_pi() Approximate_pi() 


To use the heat magic command, you need to install the py_heat_magic: 
!pip install py-heat-magic 
The heat magic command is used as follows: 
%load_ext heat 
%%heat 
def Approximate_pi(n=1e5) -> "Area": 
#Give an approximation of pi using Monte carlo simulations 
#Input Argument: number of simulations 
in C=0 
Tot =n 
while n != 0: 
A = random() 
B = random() 
if pow(A, 2) + pow(B, 2) <= 1: 
in_C += 1 # inside the circle 
n-=1 
return 4 * in_C/ Tot 


Approximate_pi() 


Chapter 5: NumPy intensive course 


This chapter focus on using the NumPy library in Python. We will provide 
example to explain how to use the functionalities of the NumPy library. But 
first, what’s the NumPy library and what is used for? We will answer these 


questions before getting into the functionalities of this library. 


| Python NumPy Library 


The NumPy library stands for Numerical Python or Numeric Python. The 
NumPy is a free open source package available in Python. This library 
focus on numerical and mathematical programming that provide support 
functions for scientific, data science as well as engineering programming. 
NumPy library is a crucial library to know if you are interested or you are 
performing statistical or mathematical operations in general. In fact, other 
libraries are built on top of NumPy library like Pandas or scikit-learn 
libraries that are developed for implementing machine learning methods. 
So, it is mandatory to start learning the basics of the NumPy library as well 


as mastering the functionalities of this package. 


The NumPy library is very efficient in handling multi-dimensional arrays as 
well as matrices operations. In fact, the NumPy library is the basic library to 
handle and process data in Python. This library offers the n-dimensional 
arrays that are called in short ndarray. The ndarray object is similar to lists 
in Python with exception that ndarray stores objects or items of the same 
type. So, it makes is convenient to perform mathematical operations on 
these ndarrays. This library is very fast and makes matrix multiplication or 
any other mathematical operations in general very simple. You might be 
thinking if ndarrays are similar to lists in Python then why not use simply 
the Python lists? The ndarrays structure or objects provided by NumPy are 
more compact, easy and fast to access whether for reading or writing items 
and is more efficient and convenient. Moreover, items or object in arrays 
are handled as a vector and support vectorized operations which is not the 
case for Python lists. So, what is an array that makes the NumPy a special 


and efficient library and how it is defined in Python? 


? Arrays in Python 


An array is a data structure that contains several items of the same type. An 
array can be any dimension. In fact, an array is a class in Python that has 
several attributes and methods related to it. Items of the ndarray can be 
accessed by index. Remember the indexing in Python starts from 0 and not 
1. Therefore, the index of the first element is 0. Every item of the ndarray is 
a data-type object that is known in Python as dtype. Items that are extracted 
from the ndarray object is considered in Python as an array of scalar type. 
All items of ndarray are stored in blocks of the same size in the memory. 
The basic function in NumPy to create ndarray is the function array() that 
is used as numpy.array(). 
First, if you did not install NumPy library from the previous chapters, you 
can do that trough Anaconda by typing in a command line: 

>>> conda install -c anaconda numpy 
Normally NumPy is installed by default in Anaconda. If you are using a 
Jupyter Notebook that presented in chapter 4, you can install NumPy by 
typing in a cell: 

pip install numpy 
Remember to use a library you should import this library into your working 
environment by typing: 

import numpy as np 
In the command above we imported numpy and assigning to is the alias np. 
To access any function of the Numpy library we should use from now on np 
instead of numpy. Now that we have everything settled and NumPy 
installed and imported into your Python working environment, we can start 


learning how to create and use ndarrays of NumPy library. 


3 Creating arrays with NumPy 


There are different ways to create a ndarray with the constructer or the 
function array(). This function typically takes the following inputs as 
follows: 

np.array(object, dtype, copy, order, subok, ndmin ) 
The first input parameter ‘object’ is the only mandatory input to the 
constructer array(). This input parameter can be any object that the 
represent a sequence of items to be stored in the ndarray. The rest of the 
input parameters are optional. The input parameter “dtype’ represent the 
desired type of data of the array. By default, this parameter is set to ‘None’ 
if it is not defined. The input parameter ‘copy’ is a Boolean parameter 
which is by default “True’. So, the object is copied. The input parameter 
‘order’ can take the following statement ‘A’, ‘C’ or ‘F’. The first is stands 
for any, the second stands for row and the third stands for column major. By 
default, ‘A’ for any is assigned to the ‘order’ input parameter. The input 
parameter ‘subok’ is also a Boolean parameter which is by default set to 
‘False’ to return array that forced to be a base class of array. In case this 
parameter is set to True, the sub-classes of array are passed. The input 
parameter ‘ndmin’ is a parameter to define the minimum dimension of the 
returned array by the constructer array(). Now that you understand what 
inputs the function array() can take, let’s get into examples of constructing 
ndarrays. 
The most obvious method to define an array is creating it via a list. For 
instance, if we have list as follows: 

>>> mylist = [1, 2, 3, 4] 
We can create an array as follows: 


>>> myatray = np.array(mylist) 


>>> print(myarray) 
[1234] 
Now we can check the type of the object myarray as follows: 
>>> print (' My array is: ', type(myarray)) 
My array is: <class 'numpy.ndarray'> 
Here, we created a numpy array from Python list that where the dimension 
of myarray is 1. We can create multi-dimensional array by passing multiple 
lists to the function array. For instance: 
>>> mylist1 = [1, 2, 3, 4] 
>>> mylist2 = [5, 6, 7, 8] 
>>> my2darray = np.array([mylist1, mylist2]) 
>>> print(' My 2Dimensional array is: ', my2darray) 
My 2Dimensional array is: [[1 2 3 4] [5 6 7 8]] 
Now if you want to force a certain type of object on the constructer array(). 
For instance, we want to create an array of complex number we can by 
specifying the ' dtype ' in the function array as in the example below: 
>>> mycomplex_array = np.array(mylist2, dtype='complex’') 
We can print mycomplex_array as follows: 
>>> print(' My array of complex data is: ', mycomplex_array) 
My array of complex data is: [5.+0.j 6.+0.j 7.+0.j 8.+0,j] 
We can see from the example above the function array() created an array of 
complex data although we passed as input a list of integers because we 
specified that the type of data is complex. The input argument ' dtype ' can 
take any of the following types: ' int’, ' float ',' bool’, ' object ', ' complex ', ' 
str'. Note that Boolean arrays can be constructed from a list of integers like 
in the example below: 
>>> mybool_array = np.array ([{1,0,100, 0, 10], dtype="bool’) 


>>> print(' My boolean list is:', mybool_array) 


My boolean list is: | True False True False True] 
Note, that any integer number starting with 1 is considered as True and 0 is 
False. You can also convert arrays into list by the attribute function tolist() 
as follows: 
>>> mylist = mybool_array.tolist() 
>>> print (mylist) 
[True, False, True, False, True] 
To check that effectively the array was converted to a list we can type: 
>>> print (' Type of my list now is:’, type(mylist)) 
Type of my list now is: <class ‘list'> 
Before we jump into the attributes of the array class that are available, let’s 
summarize the characteristics of the ndarrays. The class of ndarrays allows 
for vectorized operations unlike the lists structure of Python. When you 
create a ndarray you cannot change its size. You need to create another one 
or overwrite the previous one to get an array of different size once it is 
created. An array has a unique data type and all its items should of the 
specific data type. The NumPy arrays occupy much less memory than an 
equivalent list of Python. In the next sections we will present the available 


attributes and methods to process arrays in NumPy. 


1 Attributes of ndarray in NumPy 


In this section we will go into the details of the attributes of arrays in 
NumPy which are shape(), reshape(), ndim(), itemsize(), and flags(). The 
shape() attributes provide the shape of an array. For example, let’s get the 
shape of our two arrays: 
>>> myarrayl = np.array( [1,2,3] ) 

>>> myarray2 = np.array ( [[10,20,30], [4,5,6]]) 

>>> print(' The first array shape is:' , myarray1.shape) 

>>> print(' The second array shape is:', myarray2.shape) 

The first array shape is: (3,) 

The second array shape is: (2, 3) 
So, we can see from the example above that the first array myarray is a 
vector of 3 items while the second array is matrix of two rows and three 
columns. We can check the dimension by using the attribute ndim(). Like 
the attribute shape(), the dimension of any array can be printed as given 
below: 

>>> myarrayl = np.array( [10,20,30] ) 

>>> myarray2 = np.array ( [[50,30,40], [80,70,600]]) 

>>> print(' The dimension of my first array is:' , myarray1.ndim) 

>>> print(' The dimension of my second array is:', myarray2.ndim) 

The dimension of my first array is: 1 

The dimension of my second array is: 2 
As given by the output of the ndim() of the two arrays we created, the first 
array myarray1 is 1-dimensional array and the second array myarray2 is 2- 
dimensional array. As you can see, that ndim() and shape() are two 


attributes that provide two different information. The ndim() provide the 


dimension of the array while the shape() attribute provide the number of 
items in each dimension of the array. 
The reshape() attribute allows changing the size of an array. Let’s take for 
example the following array with shape (3, 2). In other words, the array is 
a2-dimensional array with 3 rows and 2 columns. 
>>> myartray = np.array( [[90,200,50], [80,70,90]]) 
>>> print (myarray) 
[[90 200] [50 80] [70 90]] 
>>> print (' The dimension of my array is: ', myarray.shape) 
The dimension of my array is: (2, 3) 
Now we can change the shape of the array as follows: 
>>> myreshaped_array = myarray.reshape(3,2) 
If we check the shape of the reshaped array, we get: 
>>> print (' The dimension of my reshaped array is: ', 
myreshaped_array.shape) 


The dimension of my reshaped array is: (3, 2) 


As you can conclude from the example provided, the attribute reshape() 
takes 2 input arguments M and N with M is the first-dimension number of 
items and N is the second-dimension number of items in this example. Note 
that when you reshape an array, the total number of items should not 
change. Therefore, the M by N should always be the same as the number of 
all items. In the example above we resized the array from (2,3) to (3,2) and 
the dimension remained unchanged. We can reshape the size of the array 
from a matrix (i.e. a multidimensional array to a vector) as follows: 

>>> myarray = np.array( [[70,80,60], [50,90,70]]) 

>>> b = myarray.reshape(1,6) 

>>> print (b) 


[[70 80 60 50 90 70]] 
We can now check the shape and the dimension of the new array: 

>>> print (' The dimension of my reshaped array is: ', b.ndim) 

>>> print (' The size of my reshaped array is: ', b.shape) 

The dimension of my reshaped array is: 2 

The size of my reshaped array is: (1, 6) 
Here, when we passed a second argument to the reshape attribute the 
dimension is still 2. But if we want a vector we can do as follows: 

>>> myarray = np.array( [[90,80,70], [40,70,90]]) 

>>> b = myarray.reshape(6) 

>>> print(b) 

>>> print(' The dimension of the new array is', b.ndim) 

[90 80 70 40 70 90] 

The dimension of the new array is 1 

Now we have a vector which is a 1-dimensional array. When we 
called the attribute reshape with two input argument the result was an array 
between 2 brackets (i.e. [[70 80 60 50 90 70]]) which means that the array 
is multi-dimensional array while we called the attribute reshape with a 
single attribute we had as an output : [90 80 70 40 70 90] which is single 


dimensional array. 


The itemsize() provide the size or the length of each items of the array as 
bytes. Let’s see what we get when we apply this attribute to an array formed 
by integers (i.e. int8) which have a size of 1 byte: 

>>> myatray = np.array( [1,2,34,5,6], dtype=np.int8) 

>>> print(' The length of items in my array is:', myarray.itemsize)) 


The length of items in my array is: 1 


If we apply this attribute on array formed by float32 we get 4 that is 
because floats are represented by 4 bytes. Let’s see an example: 
>>> myarray = np.array( [1,2,34,5,6], dtype=np.float32) 
>>> print(' The length of items in my array is:', myarray.itemsize)) 
The length of items in my array is: 4 
If we apply the itemsize attribute on float type we get the following output: 
>>> myarray = np.array( [1,2,34,5,6], dtype=np. float) 
>>> print(' The length of items in my array is:', myarray.itemsize)) 


The length of items in my array is: 8 


The attribute flags() has attributes that return a Boolean type of values if the 
data stored in the array are single, if the array has it owns memory or not, if 
it is writeable and if the data and items are organized correctly for the 


hardware. 


> Methods and manipulating the ndarray in 
NumPy 


There are several methods and functions available to manipulate ndarrays 
with the NumPy library. In this section, we will go through the basic 
methods and functions available and that are crucial to know for data 
analysis and statistical analysis. This section is presented as questions that 
we will reply to by presenting the NumPy function to manipulate the arrays 


with examples of applications. 


How to create an empty array or initialized array with 0 or 1 or from a 
range of values? 
To create an empty array function empty() can be used. This function will 
return an array with random values. Input arguments to this function are the 
shape and the type of data to be storedin the array object. For instance, we 
can create an array of 6 integer elements with shape (3,2) as follows: 

>>> my_array = np.empty([2,3],dtype = int) 

>>> print(' My array is:', my_array) 

My array is: [[-1018676784 465 0] [0 1 OJ] 
Note that the values in the array are random because with did not specify 
the values to be stored in the array. 
We can create an array where all values are zeros by using the zeros() 
function. This function takes as input arguments the same input arguments 
as the empty() function, the shape and the type of data. By default, float is 
assigned as the type of data. So, if type of data is not specified this function 
will create an array of floats. For instance, we can create an array of size 
(3,2) as follows: 


>>> my_array = np.zeros([2,3]) 


>>> print(' My array is:', my_array) 

My array is: [[0. 0. 0.] [0. 0. 0.]] 
Now if we specify the type that we want an array of integers, we get the 
following result: 

>>> my_array = np.zeros([2,3], dtype=np.int) 

>>> print(' My array is:', my_array) 

My array is: [[0 0 0] [0 0 OJ] 
As we created an array filled with 0, we can create an array formed with 1 
for all items by using the function ones(). Like the function’s zeros() and 
empty(), the ones() function takes as input the shape and type of the data 
where by default the dtype is float. For example: 

>>> my_array = np.ones([2,3], dtype=np. int) 

>>> print(' My array is:', my_array) 

My array is: [[1 1 1] [1 1 1]] 
To create an array from a range a value, we use the function arrange(). This 
function takes as input arguments the first start value, the end value, the 
step and type of data. For instance, if an array to be created with values 
from 1 to 10 evenly spaced by 1 (i.e. the step is 1) we can do: 

>>> my_array = np.arange(1, 10, 1, dtype=int) 

>>> print(' My array from 1 to 10 is:', my_array) 

My atray from 1 to 10 is: [123456789] 
In fact, by default the step is 1. So, the array we created above we can 
created by simply typing: 

>>> my_array = np.arange(10) 

>>> print(' My array from 1 to 10 is:'’, my_array) 

My array from 1 to 10 is: [0123456789] 
Note that, if we don’t specify each argument or if we simply pass to the 


function one argument, this argument is considered as the end value. The 


function will create an array using the default values which 0 for the start 


value and 1 for the step. 


How to access data of a ndarray in NumPy? 
The ndarrays in NumPy can be accessed like the list objects of Python. We 
can access an item by its position. Remember that the indexing in Python 
start with 0. For instance, we can get the first item of a random empty array 
like follows: 

>>> my_array = np.empty(5) 

>>> print(' My array is:', my_array) 

My array is: [9.88332530e-312 2.20687562e-312 2.37663529e-312 
7.96602522e-307 1.33508845e-306] 

>>> print(' The first item in my array is:’, my_array[0]) 

The first item in my array is: 9.88332530193e-312 
We can select few items of an array by slicing. In other words, we can 
select items that in a range of positions. For instance, we can select the first 
two rows of an array like follows: 

>>> my_A = np.array([[70,90],[100,70],[900,200]]) 

>>> a=my_Al[0:2,] 

>>> print(' The first 2 rows are:', a) 

The first 2 rows are: [[70 90] [100 70]] 


Note that Python does nit include the last index in the output of the slicing. 
Another way to access the first 2 rows and 2 columns is using the following 
indexing: 

>>> my_A = np.array([[70,90],[100,70],[900,200]]) 

>>> a =my_A[:2,:2] 


>>> print(' The result is :', a) 


The result is: [[70 90] [100 70]] 
We can get the items of the first column only by using the following 
indexing: 

>>> my_A = np.array([[70,90],[100,70],[900,200]]) 

>>> a =my_array[:,:1] 

>>> print(' Rows of the first columns are:', a) 

Rows of the first columns are: [[70] [100] [900]] 


The same concept of slicing can be applied to select the elements of a row: 
>>> my_A = np.array([[70,90],[100,70],[900,200]]) 
>>> a=my_A[:1,:] 
>>> print(' Columns of the first row are:’, a) 
Columns of the first row are: [[70 90]] 


How to detect Nan values and Inf values in ndarray in NumPy? 
It is very important to detect and handle missing values in data analysis as 
well as when computing statistics of values in arrays. In general, the 
missing values should be analyzed and removed before starting a statistical 
analysis of data. In NumPy, we can detect missing values (i.e. NaN values) 
or infinite number (i.e. inf) using the functions numpy.nan() and 
numpy.inf(). For example, let’s create an array and insert a nan value: 
>>> Y = np.empty(3) 
>>> Y[0] =4 
>>> Y[1] =2 


>>> Y[2] = np.nan 


>>> print(' Array is: ', Y) 

Array is: [ 4. 2. nan] 
Now, we can check which items are missing values: 

>>> print(' Which are nan values:’, np.isnan(Y )) 

Which are nan values: [False False True] 
The function numpy.isnan() outputs a Boolean value. If the item of the 
array is a missing value it outputs True and False otherwise. 
The numpy.isinf() as the isnan() function outputs a Boolean value. If the 
item of the array is an infinite value it outputs True and False otherwise. For 
example, Let’s create an array like we did in the previous example: 

>>> Y = np.empty(3) 

>>> Y[0] =4 

>>> Y[1] =2 

>>> Y[2] = np.nan 

>>> print(' Array is: ', Y) 

Array is: [ 4. 2. inf] 

>>> print(' Which are inf values:', np.isinf(Y)) 

Which are inf values: [False False True] 
If we want to replace the missing values or the infinite values by specific 
value, we can change those values by accessing to the position of the 
missing or infinite values. Let’s for example create two arrays where one 
has a nan value and the other inf value: 

>>> Y = np.empty(3) 

>>> Y[0] =4 

>>> Y[1] =2 

>>> Y[2] = np.nan 

>>> print(' Array is:', Y) 

Array is : [ 4. 2. nan] 


>>> Z = np.empty(3) 
>>> Z[0] = 1 
SSS 73 
>>> Z[2] = np.inf 
>>> print(' Array is : ', Z) 
Array is: [ 1. 3. inf] 
Now we will assign the value -9999 to the missing value and the infinite 
value: 
>>> i= np.isnan(Y) 
>>> j = np.isinf(Z) 
>>> Y[i] = -9999 
>>> Z[j] = -9999 
>>> print(' My new array Y is:', Y) 
>>> print(' My new array Z is:', Z) 
My new array Y is: [ 4.0 2.0 -9999 | 
My new array Z is: [ 1.0 3.0 -9999] 


How to compute basics statistics of an array? 
The NumPy library offers several functions to compute the basic statistic of 
an array that we are going through in this sib-section. The maximum or the 
minimum values of an array can be computed by calling the functions 
max() and min(). These functions are also available in Python. However, in 
NumPy they have several utilizations that we are going to learn here. Let’s 
first apply the basic functions: 
>>> Y = np.empty(5,dtype=int) 
>>> print(' My array is:’, Y) 
My array is: [1576669984 32765 1576665760 32765 131075] 


>>> print(' The maximum is:', max(Y)) 


The maximum is: 1576669984 

>>> print(' The minimum is:', min(Y)) 

The minimum is: 32765 

>>> print(' The NumPy max is:', np.max(Y)) 

The NumPy max is: 1576669984 

>>> print(' The NumPy minimum is:', np.min(Y)) 


The NumPy minimum is: 32765 


The strength of the NumPy functions is that is allows to compute the 
maximum and the minimum along the multi-dimensional arrays unlike the 
functions of Python. To get the idea behind these functions let’s create a 
multi-dimensional array as follows: 

>>> my_2dA = np.array( [[90,80,70], [40,70,90]]) 

>>> print(' The 2D array is:', my_2dA) 

My 2D array is: [[90 80 70] [40 70 50]] 

Now we can compute the minimum and the maximum along the first 
dimension of our the 2D array with function amin() and amax() as follows: 
>>> print(' The maximum of rows is', np.amax(my_2dA,0)) 

The maximum of rows is [90 80 70] 

>>> print(' The minimum of rows is', np.amin(my_2dA,0)) 

The minimum along the first dimension i.e. of rows is [40 70 50] 
>>> print(' The maximum of columns is’, np.amax(my_2dA,1)) 
The maximum columns are [90 90] 

>>> print(' The minimum of columns is’, np.amin(my_2dA, 1)) 
The minimum of columns is [70 50] 


Another function that is useful in NumPy is the ptp() function which returns 


the range values of the elements of an array. In other words, it returns the 


maximum and the minimum range of values of an array along a certain 
dimension. Let’s apply this function to our 2D array: 

>>> my_2dA = np.array( [[90,80,70], [40,70,90]]) 

>>> print(' The 2D array is:', my_2dA) 

>>> print(' Applying the ptp function gives:', np.ptp(my_2dA)) 

My 2D array is: [[90 80 70] [40 70 50]] 

Applying the ptp function gives: 70 

>>> print(' Applying the ptp function gives along the 1st dim’, 
np.ptp(my_2dA,0)) 

Applying the ptp function gives along the 1st dim [70 70 70] 

>>> print(' Applying the ptp function gives along the 2nd dim’, 
np.ptp(my_2darray,1)) 

Applying the ptp function gives along the 2nd dim [80 80] 


We can compute a percentile of the values in array by using the percentile() 
function. The percentile in statistics is the value that divides the range of 
value into blocks of a certain percentage of number of items of a set of 
values. For instance, the percentile 50 which is the median is the value that 
divides a set of values into 2 equal blocks. In other words, 50% of the data 
are below the median value and 50 % are above the median value. The 
percentile() function takes as input an array and the percentile to compute 
that is given as a value between 1 and 100 and the axis or dimension along 
which the function will compute the percentile. Now let’s apply this 
function on our 2D array: 

>>> my_2dA = np.array( [[90,80,70], [40,70,90]]) 

>>> print(' The 2D array is:', my_2dA) 

My 2D array is: [[90 80 70] [40 70 50]] 


>>> print(' The percentile 50 along the 1st axis is:’, 
np.percentile(my_2dA,0)) 

The percentile 50 along the 1st axis is: 70 

>>> print(' The percentile 50 along the 1st axis is:’, 
np.percentile(my_2dA,1)) 

The percentile 50 along the 1st axis is: 50 
If you are interested in computing the median, the function median() can be 
called as presented below: 

>>> print(' The median along the 1st axis is:', 
np.percentile(my_2dA,0)) 


The median along the 1st axis is: 70 


The mean and the weighted average of an array can be computed by calling 
the mean() and the average() functions. The difference between these two 
functions is that the first one compute the arithmetic mean which is the ratio 
of the sum of the values by total number of the array items while the second 
weighted sum compute the ratio of weighted sum of item values by their 
total number. Two input arguments as arrays should be supplied to the 
average() function with the second one is the weight assigned to each item 
in the first array. If the second input argument i.e. weight array is not passed 
to this function it computes the arithmetic mean like the mean() function. 

>>> My_array = np.empty(5,dtype=np.int8) 

>>> print(' My array is:', My_array) 

My array is: [ 1 23456] 

>>> print(' The mean of my array is:', np. mean(My_array)) 

The mean of my array is: 9.6 

>>> print(' The average of my array is:', np.average(My_array)) 


The average of my array is: 9.6 


See in the example above the mean() function and the average() function 
provide the same result because we did not supply the average function 
with a weight array. So, if we have a weight that represent the importance of 
every element of My_array we can compute the weighted average as 
follows: 

>>> weights = np.array([2,4,1,2,3]) 

>>> print(' My weights are:’, weights) 

My weights are: [2 4 1 2 3] 

>>> print(' The weighted average of my array is:’, 
np.average(My_array, weights=weights)) 

The weighted average of my array is: 6.0 
The average function can also return the sum of the weights if we supply a 
third input argument which a Boolean named returned. If this argument is 
set to True it will return the sum. By default, this argument is set False so it 
does not return the sum of weights by default. 

>>> print(' The weighted average and sum of weights of my array 
is:', np.average(My_array, weights=weights, returned="True')) 


The weighted average of my array is: (6.0, 12.0) 


To compute the standard deviation and the variance of an array we can call 
the two following functions std() and var(). If we apply these functions on 
the previous array we created My_array, we get: 

>>> print(' My array is:', My_array) 

>>> print (' The standard deviation of my array is:', 
np.std(My_array)) 

>>> print (' The variance of my array is:', np.var(My_array)) 

My array is: [ 1 2 3456] 

The standard deviation of my array is: 12.338 


The variance of my array is: 152.239 


How to sort and search for specific value in ndarray in NumPy? 
In order to sort an array in NumPy we use the function sort(). This function 
sorts the items of multi-dimensional array according to the specified axis. 
By default, the sort function will sort the values according to the first axis if 
no axis is specified. Let’s sort the array we worked with before: 
>>> print(' My array is:', My_array) 

>>> print(' My sorted array is:', np.sort(My_array)) 

My array is: [ 1 23456] 

My sorted array is: [ 1 25 6 34] 
If we have a multi-dimensional array like in the example below, we can 
specify an axis along which to sort the data as follows: 

>>> my_2darray = np.array([{[1000,20,300],[400,50,600]]) 

>>> print(' My 2D array is:', my_2darray) 

>>> print(' My 2D array sorted along axis 1 is:’, 
np.sort(my_2darray,0)) 

>>> print(' My 2D array sorted along axis 2 is:’, 
np.sort(my_2darray, 1)) 

My 2D array is: [[1000 20 300] [ 400 50 600]] 

My 2D array sorted along axis 1 is: 

[[ 400 20 300] [1000 50 600]] 

My 2D array sorted along axis 2 is: 

[[ 20 300 1000] [ 50 400 600]] 


We have learnt before in this section how to get the minimum and 


maximum values of an array, but we did not learn how to get the position of 


the minimum and the maximum values. The functions argmax() and 
argmin() provide to the position of the min and max values. 

>>> my_2darray = np.array([{[1000,20,300],[400,50,600]]) 

>>> print(' My 2D array is:', my_2darray) 

>>> print(' The indices of min values of first axis 1 are:’, 
np.argmin(my_2darray,0)) 

>>> print('The indices of min values of second axis are:’, 
np.argmin(my_2darray,1)) 

My 2D array is: [[1000 20 300] [ 400 50 600]] 

The indices of min values of first axis are: [1 0 0] 

The indices of the min values of second axis are: [1 1] 

>>> print(' The indices of max values of first axis are:’, 
np.argmax(my_2darray,0)) 

>>> print(' The indices of max values of second axis are:’, 
np.argmax(my_2darray, 1)) 

The indices of the max values of first axis are: [0 1 1] 


The indices of the max values of second axis 2 are: [0 2] 


Now we will see how to search for specific values in an array or items using 
conditions. The most common value we can search for is the non null 
values in array i.e. items that are not equal to 0. To do so NumPy offers the 
function nonzero(). This function returns the positions of the items that are 
not equal to 0. For example: 
>>> my_2darray = np.array({[1000,0,300],[400,50,0]]) 
>>> print(' My 2D array is:', my_2darray) 
>>> print(' Position of non-zero items:', np.nonzero(my_2darray)) 
My 2D array is: [[1000 200 300] [ 400 50 0]] 


Position of non-zero items: (array([0, 0, 0, 1], dtype=int64), 
array([0, 2, 0, 1], dtype=int64)) 


If we you are searching for items in array that are greater or lower than a 
certain value, the function where() can be used. For instance, let’s get the 
position of all items that are greater than 50 in our 2D array: 

>>> my_2darray = np.array([[9, 90,300],[400,50,0]]) 

>>> print(' My 2D array is:', my_2darray) 

>>> 1= np.where(my_2darray>50) 

>>> print(' The position of the items > 50 are: ', i) 

My 2D array is: [[9 90 300] [ 400 50 OJ] 

The position of the items > 50 are: (array([0, 1, 1], dtype=int64), 
array([1, 2, 1], dtype=int64)) 


Now we can access the items with values superior to 50 through the indices 
stored in the variable i and change those values to 1 for example: 

>>> my_2darray[i]=1 

>>> print(' My new array is:', my_2darray) 

My new array is: [[ 10 1] [ 150 OJ] 


Overall, we can search for items that satisfy a condition by using the 
function extract(). To use this function, the condition must be defined first. 
Let’s search for the items that has a value superior to 50 in our 2D array but 
with extract function. 

>>> my_2darray = np.array([{[1000,0,300],[400,50,0]]) 

>>> print(’ My 2D array is:', my_2darray) 

My 2D array is: [[1000 0 300] [ 400 50 0]] 


We define our condition my_2darray > 50: 


>>> cond = my_2darray>50 

>>> i= np.extract(cond,my_2darray) 

>>> print(' Items with values >50 are:’, i) 
Items with values >50 are: [1000 300 400] 


Note that extract function does not return the positions but the items 


themselves that satisfy the condition. 


Chapter 6: Pandas intensive course 


In this chapter we will explore how we can use Pandas library in Python. As 
we mentioned in the previous chapters Pandas library uses functionalities of 
the NumPy library. Before diving into to the details and functions of the 


Pandas library, let ’ s see what is the Pandas library and what is used for. 


| Python Pandas library 


The package Pandas an open source package that is available in Python and 
provides an efficient data structures to use for data analysis. In fact, Pandas 
name for the package comes from Python Data Analysis Library. It is a 
high-performance and efficient library for data analysis. This package is 
used for different applications including analytics, statistics and finance or 
economics. This library is particularly useful because it provides powerful 
tools for data preparation and cleansing, importing and exporting data in 
csv or text format, inserting and joining data, timeseries manipulation and 
more. The strength of the Pandas library comes from the Data Frame object 
that is defined in this library. A Data Frame is simply a kind of indexed 
table with rows and columns that can be read from a text file or csv file or 
can be created from a list in Python. A Data Frame can be formed from data 
with different types. Another object defined in Pandas package is series. 
The series are a 1-dimensional arrays that are formed with the same type of 
data. The third object defined in Pandas is Panel which is 3-dimensional 
array. These tools are much easier to handle and manipulate compared to 
the basic objects of Python like lists or dictionaries. To summarize in 
Pandas, we can save data in a Series if we have a homogenous 1D array 
formed by the same type of data, in Data Frame if we have a heterogeneous 
2D array which can be formed from a set of Series and finally in Panel 
which can be formed from several Data Frames. 

Before we get into the functions of Pandas library, you have to install it 
first. You can do that through Anaconda by typing in a command line: 


>>> conda install -c anaconda pandas 


If you are using a Jupyter Notebook presented in chapter 4, you can install 
Pandas by typing in a cell: 

pip install pandas 
Pandas library has some dependencies on other libraries (NumPy is the 
most important one) and matplotlib for creating figures that we are going to 
see in the next chapter. By installing Pandas from a package, you get all the 
dependencies by default. The Pandas library should be imported first into 
your working environment to be able to use it by typing: 

import numpy as np 


import pandas as pd 


Remember that we imported NumPy library too because Pandas library 
depend on the NumPy. It is a good to get the habit to import both libraries 


when doing data analysis. 


? Data structures in Pandas 


We mentioned in the previous section that Pandas offers three type of data 
structures Series, Data Frame and Panel. These structures have the 
characteristics of being mutable which means that their values can be 
changed. Series is a single dimensional array that has data from the same 
type. The values of Series can be changed but not the size of the Series. 
That is once a Series of N size is defined, we cannot change its size only the 
values can be changed. The object Data Frame which is the most used data 
structure, uses rows and columns instead of axis notion unlike the arrays of 
Numpy library. This data structure a 2D array with different type of data. 
Both the size and values of the data stored in a Data Frame can be changed 
after the Data Frame is created. Panels are not much used as Series and 
Data Frames. They have the same characteristics as the Data Frame data 
structure. In fact, Panels can be viewed as a set of a Data Frames. Because 
Series and Data Frames are the most used data structures we will focus on 
these two data structures in this chapter. Now let’s see how we can define 


and create these two data structures. 


How to define and create Series in Pandas? 

Different methods exist to define a Series in Pandas. In fact, we can use 
simply the attribute Series() in order to create a Series data object. This 
attribute takes as input argument data which can be a ndarray or a list, index 
which should be formed by unique values and has the same size as the data 
object, dtype that is type of data and a last argument which is a Boolean 
argument by default is False. We can create an empty Series using the 
Series() function without supplying any argument as follows: 


>>> mySerie = pd.Series() 


>>> print (' My Serie is:', mySerie) 

My Serie is: Series([], dtype: float64) 
We can create a Series from a ndarray as follows: 

>>> my_A= np.array([100,200,300,400,500]) 

>>> mySerie = pd.Series(my_A) 

>>> print (' My Serie from ndrray is: \n', mySerie) 

My Serie from ndrray is: 

0 100 

1 200 

2 300 

3 400 

4 500 

dtype: int32 
Note here, that we did not supply an index to create the Series. By default, 
this function create with index generated by the function range(). We can 
create also a Series from a dictionary. If we don’t supply an index to the 
Series() function, this function will use the keys of the dictionary as the 
index. Let’s see an example with and without passing an index to the 
Series() function: 

>>> my_D= {'A':100, 'B':200, 'C':300, 'D':400} 

>>> mySeries = pd.Series (my_D) 

>>> print (' My series from dictionary without index is:', mySeries) 

>>> ind = ['A','C’,'B','D'] 

>>> my_dict = {'A':1, 'B':2, 'C':3, 'D':4} 

>>> mySeries2 = pd.Series (my_dict,index=ind) 

>>> print(' My series from dictionary with index is:’, mySeries2) 

My series from dictionary without index is: 

A 100 


B 200 

C 300 

D 400 

dtype: int64 

My series from dictionary with index is: 
A 100 

C 300 

B 200 

D 400 

dtype: int64 


Note that when we did not supply the index, the function used the keys of 
the dictionary in order. In the other case when the index is supplied the 
function ordered the items of the dictionary according to the order of the 
index. 
We can initialize a Series with a specific value by providing this value the 
function Series() as follows: 

>>> mySeries = pd.Series (1, index=[100,,200,300,400]) 

>>> print(' My series initialized with 1 is:', mySeries) 

My series initialized with 1 is: 

100 1 

200 1 

300 1 

A400 1 

dtype: int64 
Elements stored in a Series can be accessed by position like ndarrays of 
NumPy. For instance, if we want the first element, we access it as follows: 

>>> mySeries = pd.Series([100,200,300,400]) 


>>> print (' My resulted series is:’, mySeries) 
My Series is: 
0 100 
1 200 
2 300 
3 400 
dtype: int64 
>>> print (‘The first element in my Series is:'’, mySeries[0]) 
The first element in my Series is: 100 
We can apply slicing to get a range of position in a Series: 
>>> mySeries = pd.Series([1,2,3,4]) 
>>> print (‘My first two elements in my Series is:', mySeries[0:2]) 
My first two elements in my Series is: 
0 100 
1 200 
dtype: int64 


Or we can get the last two elements applying the slicing presented below: 
>>> mySeries = pd.Series([100,200,300,400]) 
>>> print (‘The last two elements in my Series are:', mySeries[-2:]) 
The last two elements in my Series are: 
2 300 
3 400 
dtype: int64 


Another way to access element in a Series is by the labels of the indexing if 
the data is labeled. For example, if we have the following series: 
>>> mySeries = pd.Series([100,200,300,400], index=['X','Y','Z','P']) 


>>> print (' My labeled Series is:', mySeries) 

My labeled Series is: 

X 100 

Y 200 

Z 300 

P 400 

dtype: int64 
Now we can get the first element by it label index as follows: 

>>> print(' The first element of my labeled data is:', mySeries['X']) 

The first element of my labeled data is: 100 
If we want to get multiple elements using indexing, we just supply the list 
of labels as follows: 

>>> print(' The first three elements of my labeled data is:’, 
mySeries[['X',"Y','Z']]) 

The first three elements of my labeled data are: 

X 100 

Y 200 

Z 300 

dtype: int64 


How to define and create a Data Frame in Pandas? 

The Data frame structure is a 2D data type with a tabular format. This data 
structure can be defined and created in the same fashion as Series with 
exception that Data Frame has an index for columns too. Let’s say for 
example we have data of patients as follows: 


Example of data 





James male 30 no 





This set of data is stored in the same format in a Data Frame data structure. 
To create a Data Frame the DataFrame() function can be used. This 
function takes the following input data, index to define labels of rows, 
columns to define labels for columns and dtype for type of data saved in 
every column. The input argument data can be anything from list, 
dictionary, a ndarray type or a DataFrame. Now let’s see the different ways 


we can create a DataFrame for the example data given in the table above. 
First, we can create an empty by not supplying any input to the 
DataFrame() function: 

>>> myDataFrame = pd.DataFrame() 

>>> print(' My empty DataFrame is:', myDataFrame) 

My empty DataFrame is: Empty DataFrame 


Columns: [] Index: [] 


We can define only for the age variable only a DataFrame from a 1D list as 
follows: 

>>> Age = [29,30,40,35,30] 

>>> myAgeDataframe = pd.DataFrame(Age) 

>>> print(' My Age DataFrame is:', myAgeDataframe) 

My Age DataFrame is: 

0 

0 29 

130 

2 40 

335 


4 30 
Note here we did not supply indexing for the columns or rows. The function 
uses by default range of number of rows and columns as indexing. So, in 
this example we have only one column that was indexed as 0 and rows 
indexed from 0 to 4. In the next example we will define a DataFrame for 
the first two columns of our data using indexing for columns: 

>>> mydata = [['Alec','male'], ['James','male'],['Mark','male’], 
['Silvia','female'],["Helene’,'female']] 

>>> myDataFrame = pd.DataFrame(mydata,columns= 
['Name','Gender']) 

>>> print(' My indexed DataFrame is:', myDataFrame) 

My indexed DataFrame is: 

Name Gender 

0 Alec male 

1 James male 

2 Mark male 

3 Silvia female 

4 Helene female 
We can also define our data as dictionary and create a DataFrame from this 
dictionary as follows: 

>>> myDictdata= {'Name': 
['Alec','James',’'Mark','Silvia',"Helene'],'Gender': 
['male','male','male','female','female']} 

>>> myDataFrame = pd.DataFrame(myDictdata) 

>>> print(' My DataFrame from dictionary is:', myDataFrame) 

My DataFrame from dictionary is: 

Name Gender 


0 Alec male 


1 James male 

2 Mark male 

3 Silvia female 

4 Helene female 
Note here that the DataFrame used the keys of the dictionary ‘Name’ and 
‘Gender’ as indices for columns. We can change the indexing of the 
DataFrame rows by supplying an index labels for rows: 

>>> myDictdata = {'Name': 
['Alec','James',’'Mark','Silvia',"Helene'],'Gender': 
['male','‘male','male','female','female'] } 

>>> myDataFrame = pd.DataFrame(myDictdata, index= 
TX Y','Z',',T]) 

>>> print(' My indexed DataFrame from dictionary is:’, 
myDataFrame) 

My indexed DataFrame from dictionary is: 

Name Gender 


X Alec male 
Y James male 
Z Mark male 


J Silvia female 


I Helene female 


We can also create a DataFrame from a pandas Series as follows: 

>>> mydictSeries = { 
'Age':pd.Series([29,39,4,35,30]),'Name':pd.Series(['Alec'’, James','Mark','Sil 
via','Helene']) } 


>>> myDataFrame = pd.DataFrame(mydictSeries) 


>>> print(' My DataFrame from a dictionary of Series is:', 


myDataFrame) 


In this example we created a dictionary from pandas Series, then we defined 
a DataFrame from this dictionary. Now let’s create a DataFrame for the data 
presented in the table above and see the different ways to select and extract 
data from a DataFrame and the operation we can do on a DataFrame. 

>>> name = ['Alec','James',’"Mark','Silvia','Helene'] 

>>> Gender = ['male','male','male','female','female'] 

>>> Age = [29,39,4,35,30] 

>>> Smoker = ['no’, 'no’, 'yes', 'no’, 'no'] 

>>> mydict = {'Name': name, 'Gender': Gender, 'Age': Age, 
'Smoker': Smoker} 

>>> mydata= pd.DataFrame(mydict) 

>>> print (' My data in a DataFrame is:', mydata) 

My data in a DataFrame is: 

Name Gender Age Smoker 

0 Alec male 29 no 

1 James male 39 no 

2 Mark male4 yes 

3 Silvia female 35 no 


4 Helene female 30 no 


Now that we have created our DataFrame, we can select a column from our 
DataFrame by using the indexing: 

>>> print('‘Selected Name column from my DataFrame is:’, 
mydata['Name']) 

Selected Name column from my DataFrame is: 


0 Alec 

1 James 

2 Mark 

3 Silvia 

4 Helene 

Name: Name, dtype: object 


To select rows, we can use slicing like we did we Pandas Series. For 
example, let’s select the first two rows: 
>>> print('The first two rows of my DataFrame is:', mydata[0:2]) 
The first two rows of my DataFrame is: 
Name Gender Age Smoker 
0 Alec male 29no 


1 James male 39no 


The DataFrame data structure has the characteristic of being size mutable 
which means we can change its size after it was defined. Now let’s see how 


we can add or delete columns or rows from a DataFrame. 


How to delete or add columns/rows in a DataFrame? 
We can delete a column from the DataFrame as follows by using the 
function del: 
>>> del mydata['Smoker'] 

>>> print (' My new DataFrame is:', mydata) 

My new DataFrame is: 

Name Gender Age 

0 Alec male 29 


1 James male 39 


2 Mark male 4 
3 Silvia female 35 
4 Helene female 30 


In the above example, we deleted the column ‘Smoker’ from the 
DataFrame. The function pop(Q) is another method among Pandas functions 
to delete a column from a DataFrame: 

>>> mydata.pop('Gender') 

>>> print (' My new DataFrame is:', mydata) 

My new DataFrame is: 

Name Age 

0 Alec 29 

1 James 39 

2Mark 4 

3 Silvia 35 

4 Helene 30 


We can add columns in a DataFrame by supplying a Series. Let’s add back 
the gender column to our DataFrame by passing as a Series: 
>>> mydata['Gender'] = 

pd.Series(['male’,'male’,'male’,'female','female']) 

>>> print(' The DataFrame is now:', mydata) 

The DataFrame is now: 

Name Age Gender 

0 Alec 29 male 

1 James 39 male 

2Mark 4 male 

3 Silvia 35 female 


4 Helene 30 female 


To delete a row from a DataFrame, the function drop() can be used. Let’s 
see how we can delete the last row of our DataFrame: 

>>> mynewdata = mydata.drop(4) 

>>> print (' My new DataFrame without the last row is:’, 
mynewdata) 

My new DataFrame without the last row is: 

Name Age Gender 

0 Alec 29 male 

1 James 39 male 

2 Mark 4 male 

3 Silvia 35 female 
The drop() function uses the indexing of rows to delete rows. In order to 
add rows, we use the function append(). Let’s add back the last row we 
just deleted: 

>>> mydict = {"Name': 'Helene’, ‘Age’: '30','Gender':'female'} 

>>> lastrow = pd.DataFrame(mydict) 

>>> mynewdata = mynewdata.append(mydict) 

>>> print('The DataFrame with the last row is:’, 
mynewdata) 

The DataFrame is now: 

Name Age Gender 

0 Alec 29 male 

1 James 39 male 

2 Mark 4 male 

3 Silvia 35 female 

4 Helene 30 female 


Now that we have learnt how to create Series and Data Frames with Pandas. 
Now we are going to learn in the next section the different functions and 


attributes of these data structures. 


3 Attributes of Series and DataFrame in Pandas 


There are several attributes and methods related to Series and DataFrame 
that make it easy to explore and view the characteristics of the data in these 
data structures. We will start first with the attributes of the Series data 
structure. Series in Pandas library has the following attributes: 1) axes 
which provides the labels list of the rows in a Series, 2) dtype provides the 
dtype of the objects forming the Series, 3) empty which is a Boolean that 
indicates if the Series is empty (i.e. True if empty False otherwise), 3) ndim 
provides the dimension of the data which is by default 1, 4) size provides 
the number of items in the Series, 5) values provides the elements in a 
Series as ndarray, 6) head() list the first n rows of the Series by default the 
first 5 rows, 7) tail() list the n rows of the Series by default the last 5 rows. 
In order to get familiar with the functionalities we will load the Iris dataset 
and practice the examples. The Iris dataset is widely used sample data set in 
machine learning examples. We will use it in the intensive course to learn 
how to manipulate Series and Data Frames. 
The Iris dataset is formed by a sample of 3 species of the Iris flower. Each 
species is described by sepals and petals length and sepals and petals width. 
This dataset is available in the sklearn library which is designed for 
machine learning. You will need to install this library as we did for the 
Pandas and the NumPy library. Then import the library and the dataset as 
follows: 

>>> import numpy as np 

>>> import pandas as pd 

>>> from sklearn import datasets 


>>> Iris = datasets.load_iris() 


The Iris dataset is stored as dictionary, we can check that by the following 
command: 
>>> print (' The type of the Iris dataset is:’, type(Iris)) 
The type of the Iris dataset is: <class 'sklearn.utils.Bunch'> 
We have to retrieve the data from the Iris dataset as follows: 
>>> Iris_feat = Iris.data 
>>> print (‘Now the Iris data is as:', type(Iris_feat)) 


Now the Iris data is as: <class 'numpy.ndarray'> 


Now we are going to create a DataFrame from the Iris data as we have 
learnt in the previous section as follows: 
>>> Iris_DF = pd.DataFrame(Iris_feat, 
columns=Iris.feature_names) 
We have created a DataFrame from the Iris data using the Iris features to 
index the columns. Now that you are all set and have a sample of dataset to 
practice, let’s retrieve the first column and explore the Series features. 
Remember to select a column we can use indexing as we learnt in the 
previous section: 
>>> C1 = Iris_DF [1] 
The above command line will return the first column of the DataFrame. 
Now we will create a series from the returned array: 
>>> MySeries = pd.Series (C1) 
Now we will apply the attribute we listed: 
>>> # Checking if the list is empty with the empty attribute 
>>> print(' My Series is empty:', MySeries.empty) 
My Series is empty: False 
>>> # The number of axis with the attribute axis 


>>> print (' The Number of axis in my Series is:', MySeries.axes) 


The Number of axis in my Series is: [RangeIndex(start=0, stop=150, 
step=1)] 
>>> # The dimension with the attribute ndim 
>>> print (' The dimension of my Series is:', MySeries.ndim) 
The dimension of my Series is: 1 
>>> # The dtype with dtype with dtype attribute 
>>> print (' The dtype of my Series is:', MySeries.dtype) 
The dtype of my Series is: float64 
>>> # Values with the attribute values 
>>> V = MySeries.values 
>>> print (‘The 5 first values of my Series are:', V[0:5]) 
The 5 first values of my Series are: [3.5 3. 3.2 3.1 3.6] 
>>> Getting the 5 first rows of the Series with head() attribute 
>>> MySeries.head() 
03.5 
13.0 
2 3:2 
3 3.1 
43.6 
Name: 1, dtype: float64 
>>> #Getting the last 5 rows of the Series with tail() attribute 
>>> MySeries.tail() 
145 3.0 
146 2.5 
147 3.0 
148 3.4 
149 3.0 
Name: 1, dtype: float64 


In addition to the attribute that we have presented for the Series data 
structure, the DataFrame data structure has some attribute to get explore not 
only rows but also the columns. These additional attributes are: 1) T to 
transpose columns and rows, 2) shape which provide the DataFrame 
dimension. Now let’s explore the entire DataFrame we created from the Iris 
dataset. 
>>> #check if the Dataframe is empty 
>>> print (' My DataFrame is empty:', Iris_ DF.empty) 
My DataFrame is empty: False 
>>> #Get the dimension of DataFrame with ndim attribute 
>>> print (' The dimension of my DataFrame is:', Iris_ DF.ndim) 
The dimension of my DataFrame is: 2 
>>> #Get the size of the DataFrame with size attribute 
>>> print (' The size of my DataFrame is:', Iris_ DF.size) 
The size of my DataFrame is: 600 
>>> #Get the shape of the DataFrame with shape attribute 
>>> print (' The shape of my DataFrame is:’, Iris_ DF.shape) 
The shape of my DataFrame is: (150, 4) 
>>> #Get the values as ndarray from the DataFrame 
>>> V = Iris_ DF.values 
>>> print (‘The 5 first values of my DataFrame are:', V[0:5,]) 
The 5 first values of my DataFrame are: 
[[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 
3.6 1.4 0.2]] 
>>> Get the 5 first rows with attribute head() 
>>> Iris_ DF.head() 


sepal length sepal width petal length petal width (cm) 


(cm) (cm) (cm) 


0 5.1 oH) 1.4 0.2 
1 4.9 3 1.4 0.2 
2 4.7 By 13 0.2 
3 4.6 3.1 1.5 0.2 
4 5 3.6 1.4 0.2 
>>> Get the 5 last rows with attribute tailQ) 
>>> Iris_ DF.tail() 
sepal width petal length petal width 
sepal length (cm) (cits Ga ein) 
145 6.7 3 ee. 2.3 
146 6.3 2.5 5 1.9 
147 6.5 3 ay 2 
148 6.2 3.4 5.4 2.3 


149 5.9 3 5.1 1.8 


1 Functions to compute statistics and manipulate 
DataFrame in Panda 


This section is concentrated on the DataFrame type structure. All examples 
and applications will be presented using the Iris data set we presented in the 
previous section. So, we will start by importing and saving the dataset in a 
DataFrame. Then the section will be presented as a series of questions that 
we will answer by providing the functions to use and how to use. 

>>> import pandas as pd 

>>> import numpy as np 

>>> from sklearn import datasets 

>>> Iris = datasets.load_iris() 

>>> Iris_data = pd.DataFrame(Iris_data, 


columns=Iris.feature_names) 


How to detect Nan values and Inf values in DataFrame in Pandas? 
Inspecting missing values in data is typically the first step in data analysis 
and statistical analysis in general. To check missing in DataFrame in 
pandas, the function pd.isnull() can be used. This function outputs a 
Boolean ndarray that contains True if missing value and False otherwise. 
Now let’s check if the Iris dataset has any missing values: 

>>> missing_val = pd.isnull(Iris_data) 


>>> missingval.head() 


sepal ; p 
itces sepal width petal petal width 
sae (cm) length (cm) (cm) 
(cm) 
0 False False False False 
1 False False False False 


2 False False False False 


3 False False False False 
4 False False False False 


This function returns typically an object similar to the DataFrame supplied 
where each row/column is a Boolean vector that indicates if a missing value 
or not. 
A simple to inspect if there are any missing values is to get the number of 
elements in the DataFrame with missing values. The function 
pd.isnull().sum() is a useful function to detect missing values : 

>>> pd.isnull(Iris_data).sum() 

sepal length (cm) 0 

sepal width (cm) 0 

petal length (cm) 0 

petal width (cm) 0 

dtype: int64 
As can be noticed from the results above, the pd.isnull().sum() outputs the 
number of elements for each column with missing values separately. In this 
example, the Iris dataset does not have any missing values. 
In case missing data are among the dataset, these missing values can be 
deleted using the function DataFrame.dropna() or can be replaced with a 
specific value X with the function DataFrame.fillna (X). 
How to compute basic statistic of DataFrame data in Pandas? 
To compute basic statistic of the datasets of a DataFrame data structure the 
following functions can be used. These functions return the statistics of 
each column separately: 

>>> # Get the mean of the DataFrame with the function 
DataFrame.mean() 
>>> print (‘Iris sample data Average Features is:', Iris_data.mean()) 


Iris sample data Average Features is: 


sepal length (cm) 5.843333 

sepal width (cm) 3.057333 

petal length (cm) 3.758000 

petal width (cm) 1.199333 

dtype: float64 

>>> # Get the number of non-null values with the function 

DataFrame.count() 

>>> print (' The number of non-null values of the Features of the Iris 

sample data is:', Iris_data.count()) 

The number of non-null values of the Features of the Iris sample data 

is: 

sepal length (cm) 150 

sepal width (cm) 150 

petal length (cm) 150 

petal width (cm) 150 

dtype: int64 

>>> #Get the standard deviation of the DataFrame with the function 

DataFrame.std() 

>>> print (' The standard deviation of the Features of the Iris sample 

data is:', Iris_data.std()) 

The standard deviation of the Features of the Iris sample data is: 

sepal length (cm) 0.828066 

sepal width (cm) 0.435866 

petal length (cm) 1.765298 

petal width (cm) 0.762238 

dtype: float64 

>>> #Get the max value of the DataFrame with the function 


DataFrame.max() 


>>> print (' The max values of the Features of the Iris sample data is:', 
Iris_data.max()) 
The max values of the Features of the Iris sample data are: 
sepal length (cm) 7.9 
sepal width (cm) 4.4 
petal length (cm) 6.9 
petal width (cm) 2.5 
dtype: float64 
>>> #Get the min value of the DataFrame with the function 
DataFrame.min() 
>>> print (' The min values of the Features of the Iris sample data is:’, 
Iris_data.min()) 
The min values of the Features of the Iris sample data are: 
sepal length (cm) 4.3 
sepal width (cm) 2.0 
petal length (cm) 1.0 
petal width (cm) 0.1 
dtype: float64 
>>> #Get the median value of the DataFrame with the function 
DataFrame.median() 
>>> print (' The median values of Iris sample data features is:’, 
Iris_data.median()) 
The median values of Iris sample data feature are: 
sepal length (cm) 5.80 
sepal width (cm) 3.00 
petal length (cm) 4.35 
petal width (cm) 1.30 
dtype: float64 


>>> #Get the correlation between the columns of a DataFrame with 
the function DataFrame.corr() 

>>> print (' The correlation between the Features of the Iris sample 

data is:', Iris_data.corr()) 

The correlation between the Features of the Iris sample data is: 
sepal length (cm) sepal width (cm) petal length (cm) \ sepal length 
(cm) 1.000000 -0.117570 0.871754 sepal width (cm) -0.117570 
1.000000 -0.428440 petal length (cm) 0.871754 -0.428440 1.000000 
petal width (cm) 0.817941 -0.366126 0.962865 petal width (cm) 
sepal length (cm) 0.817941 sepal width (cm) -0.366126 petal length 
(cm) 0.962865 petal width (cm) 1.000000 


The last function DataFrame.corr() returns the correlation matrix between 
the columns of the DataFrame. We can also compute the mode with the 
function DataFrame.mode(), the cumulative sum with 
DataFrame.cumsum(), get the absolute values with DataFrame.abs() 
function, compute the product of values with DataFrame.prod() function, 
compute the cumulative product with the DataFrame.cumprod() function. 
In order to get the summary and descriptive statistics of a DataFrame, you 
can simply use the function DataFrame.describe() as follows: 


>>> Iris_d.describe() 


sepal 


sepal width petal petal width 

length 
(cm) length (cm) (cm) 

(cm) 
count 150 150 150 150 
mean 5.843333 = 3.057333 3.758 — 1.199333 
std 0.828066 0.435866 1.765298 0.762238 
min 4.3 2 1 0.1 
25% oak 2.8 1.6 OS 


50% 5.8 3 4.35 1.3 


75% 6.4 She! 5.1 1.8 
max 7.9 4.4 6.9 2.5 


How to filter/sort and groupby data in a DataFrame using Pandas? 
A DataFrame data structure can be sorted using the function 
DataFrame.sort_values(). This function takes as input argument the 
column by which to sort the DataFrame and the direction of sorting 
ascending or descending. Let’s for example sort the Iris DataFrame 
according the first column in ascending way: 

>>>sorted_data = Iris_data.sort_values(‘sepal length 
(cm)',ascending=True) 


>>> sorted_data.head() 


sepal : : 

sepal width petal petal width 

length 
(cm) length (cm) (cm) 

(cm) 
13 4.3 3 eal 0.1 
42 4.4 oe i 0.2 
38 4.4 3 13 0.2 
8 4.4 2.9 1.4 0.2 
41 4.5 203 13 0.3 


We can also sort the data according to several columns in different 
directions. For instance, let’s sort the data in ascending according to the first 


column and descending according to the second column: 


>>> sorted_data = Iris_data.sort_values(['sepal length (cm)','sepal 
width (cm)'],ascending=[True,False]) 


>>> sorted_data.head() 


sepal : : 

sepal width petal petal width 

length 
(cm) length (cm) (cm) 

(cm) 
13 4.3 3 al 0.1 
42 4.4 3.2 3 0.2 
38 4.4 3 L3 0.2 
8 4.4 2a) 1.4 0.2 
41 4.5 Zee 3 0.3 


A dataframe data structure can be filtered according to conditions using the 
statements ‘and’ and ‘or’ if several conditions are to be combined. For 
instance, let’s filter the Iris dataset according the sepal length. Let’s select 
the Iris flowers with sepal length superior to 5 cm: 
>>> filtered_data = Iris_data [Iris_data['sepal length (cm)'] >5] 

>>> print ("The number of Iris flowers with sepal length > 5 is:’, 
filtered_data.shape) 

The number of Iris flowers with sepal length > 5 is: (118, 4) 


Now let’s an example with a multiple condition. Let’s filter the Iris flower 
such as the sepal length is > superior to 5 cm and inferior to 6 cm: 
>>> filtered_data = Iris_data [ (Iris_data['sepal length (cm)'] >5) & 
(Iris_data['sepal length (cm)']<6)] 
>>> print ("The number of Iris flowers with sepal length between 
5cm and 6 cm is:', filtered_data.shape) 
The number of Iris flowers with sepal length > 5 is: (51, 4) 


The DataFrame data structure can be grouped into groups according to a 
specific criterion. To do so the DataFrame.groupby() function is available 
in Pandas. This function takes as input argument the column(s) according to 
which the DataFrame is to be grouped by. For example. Let’s group the Iris 
data set by the sepal length: 

>>> grouped_data=Iris_data.groupby('sepal length (cm)') 
Now to access a group X in the grouped data we can use the function 
get_grouped(X). In our example, we can get the group of Iris flower having 
the sepal length equal to 5.1 cm with the following command line: 

>>> X = grouped_data.get_group(5.1) 

>>> print(' The group of Iris flowers with sepal length equal to 5 
cm is:', X) 

The group of Iris flowers with sepal lenght equal to 5 cm is: 

sepal length (cm) sepal width (cm) petal length (cm) petal width 
(cm) 

0 5.13.5 1.40.2 

175.13.51.40.3 

19 5.1 3.81.5 0.3 

215.1 3.71.5 0.4 

23 5.1 3.3 1.70.5 

39 5.1 3.41.5 0.2 

44 5.1 3.8 1.9 0.4 

46 5.1 3.8 1.6 0.2 

98 5.1 2.5 3.0 1.1 


We can explore or iterate trough the groups of the data frame as follows: 
>>> for i, g in grouped_data: 


print (i) 


print (g) 
For convenience and because the high number of groups, we present below 


only a selected output of the loop command above: 


sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 
13 4.3 3.0 1.1 0.1 4.4 

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 
8 4.42.9 1.40.2 

38 4.4 3.0 1.3 0.2 

42 4.43.2 1.30.2 4.5 


A practical way to explore the groups of a DataFrame is to use the function 
aggregate which is agg(). For instance, the mean value of the sepal width 
can be computed with the np.mean() function of each group of sepal length 
as follows: 

>>> M = grouped_data['sepal width (cm)'].agg(np.mean) 

>>> print(' The mean sepal width in cm for each group of sepal 

length is:', M) 

The mean sepal width in cm for each group of sepal length is: sepal 

length (cm) 

4.3 3.000000 

4.4 3.033333 

4.5 2.300000 

4.6 3.325000 

4.7 3.200000 

4.8 3.180000 

4.9 2.950000 

5.0 3.120000 


5.1 3.477778 
5.2 3.425000 
5.3 3.700000 
5.4 3.550000 
5.9 2.842857 
5.6 2.816667 
5.7 3.100000 
5.8 2.885714 
5.9 3.066667 
6.0 2.733333 
6.1 2.850000 
6.2 2.825000 
6.3 2.855556 
6.4 2.957143 
6.5 3.000000 
6.6 2.950000 
6.7 3.050000 
6.8 3.000000 
6.9 3.125000 
7.0 3.200000 
7.1 3.000000 
7.2 3.266667 
7.3 2.900000 
7.4 2.800000 
7.6 3.000000 
7.7 3.050000 
7.9 3.800000 
Name: sepal width (cm), dtype: float64 


We can get multiple information or statistics using the aggregation function. 
For example, let’s compute the sum and mean of the sepal width for each 
group using the NumPy Statistic functions: 

>>> M = grouped_data['sepal width (cm)'].agg([np.mean,np.sum]) 

>>> print(' The basic statistics of sepal width in cm for each group 
are:', M) 

The basic statistics of sepal width in cm for each group are: 

mean sum 

sepal length (cm) 

4.3 3.000000 3.0 

4.4 3.033333 9.1 

4.5 2.300000 2.3 

4.6 3.325000 13.3 

4.7 3.200000 6.4 

4.8 3.180000 15.9 

4.9 2.950000 17.7 

5.0 3.120000 31.2 

5.1 3.477778 31.3 

5.2 3.425000 13.7 

5.3 3.700000 3.7 

5.4 3.550000 21.3 

5.9 2.842857 19.9 

5.6 2.816667 16.9 

5.7 3.100000 24.8 

5.8 2.885714 20.2 

5.9 3.066667 9.2 

6.0 2.733333 16.4 


6.1 2.850000 17.1 
6.2 2.825000 11.3 
6.3 2.855556 25.7 
6.4 2.957143 20.7 
6.5 3.000000 15.0 
6.6 2.950000 5.9 
6.7 3.050000 24.4 
6.8 3.000000 9.0 
6.9 3.125000 12.5 
7.0 3.200000 3.2 
7.1 3.000000 3.0 
7.2 3.266667 9.8 
7.3 2.900000 2.9 
7.4 2.800000 2.8 
7.6 3.000000 3.0 
7.7 3.050000 12.2 
7.9 3.800000 3.8 


You can easily get the size of each group using the np.size function . Let’s 
see the size of the groups in the Iris dataset: 
>>> print (' The size of each group of sepal length of Iris dataset is:', 
grouped_data.agg(np.size)) 
The size of each group of sepal length of Iris dataset is: 
sepal width (cm) petal length (cm) petal width (cm) sepal length (cm) 
4.3 1.0 1.0 1.0 
4.4 3.0 3.0 3.0 
4.5 1.0 1.0 1.0 
4.6 4.0 4.0 4.0 


4.7 2.0 2.0 2.0 
4.8 5.0 5.0 5.0 
4.9 6.0 6.0 6.0 
5.0 10.0 10.0 10.0 
5.1 9.0 9.0 9.0 
5.2 4.0 4.0 4.0 
5.3 1.0 1.0 1.0 
5.4 6.0 6.0 6.0 
5.5 7.0 7.0 7.0 
5.6 6.0 6.0 6.0 
5.7 8.0 8.0 8.0 
5.8 7.0 7.0 7.0 
5.9 3.0 3.0 3.0 
6.0 6.0 6.0 6.0 
6.1 6.0 6.0 6.0 
6.2 4.0 4.0 4.0 
6.3 9.0 9.0 9.0 
6.4 7.0 7.0 7.0 
6.5 5.0 5.0 5.0 
6.6 2.0 2.0 2.0 
6.7 8.0 8.0 8.0 
6.8 3.0 3.0 3.0 
6.9 4.0 4.0 4.0 
7.0 1.0 1.0 1.0 
7.11.0 1.0 1.0 
7.2 3.0 3.0 3.0 
7.3 1.0 1.0 1.0 
7.41.0 1.0 1.0 


7.6 1.0 1.0 1.0 
7.7 4.0 4.0 4.0 
7.9 1.0 1.0 1.0 


How to save and load data from a text file? 
The Pandas library offers functions that allow saving and reading data from 
and to text files. For example, Iris data frame can be saved ina csv file 
with function DataFrame.to_csv(). The input of this function is the file 
name which dataset in data frame are to be save: 

>>> Iris_data.to_csv(‘filename.csv') 
In order to load or read data saved in a csv file, the function pd.read_csv() is 
very useful. Let’s save for example in a file called mydata.csv the dataset 
below: 


Example data 





In order to load this data, we type: 

>>> data = pd.read_csv(‘mydata.csv.csv') 
>>> print(' The type of data loaded is:’, type(data)) 

The type of data loaded is: <class 'pandas.core.frame.DataFrame'> 
>>> print(' Loaded data is:', data) 

The loaded data is: 
Name Gender Age Smoker 

0 Alec male 29no0 


1 James male 30no 
2 Mark male 40 yes 
3 Silvia female 35 no 


4 Helene female 30 no 


To summarize, in this chapter we have learnt how to create Series and Data 
Frames with Pandas. We also learnt how to handle, detect, delete and 
replace missing values as well as computing basic statistics and filtering, 
sorting and grouping data. In the next chapter we are going to learn how we 
can visualize and plot figures of data using the Pandas library or the 


matplotlib library. 


Chapter 7: Visualisation and results 


In the previous chapters we learnt how to handle data. In this chapter we are 
going to learn methods of visualizing data as well as creating figures to 
present analysis of data. In order to develop figures, many libraries are 
available in Python. This section presents only functionalities of the 


matplolib library which is an advanced library in Python to develop figures. 


7.1. Matplotlib library in Python 
Matplolib library is an open source advanced package available in Python 
for data visualization. Data visualization is crucial in data analysis as well 
as to communicate the results to stakeholders. This library is based on the 
NumPy library too. One module of matplotlib library that is very used is the 
Pyplot. This module has similar interface as Matlab a programming tool 
that is efficient for numerical programming. If you did not install matplotlib 
yet, you can do so by typing the following command in python prompt: 

pip install matplotlib 

If you have installed Anaconda and you are using Jupyter, this library 
should be already installed by default. All you have to do is import the 
package. 
Before dining into examples and how to use the matplotlib library, let ’ s 
see the component of a figure that we can set. A figure is entire figure that 
is formed by one or more axes which are called a plot. Axes is what is 
commonly named as a plot. A figure can be formed by different axes 
depending on the type of plotting we are making 1D, 2D or 3D. Axis are 
responsible of setting the limits of a plot. Artist is all the components that 


can be in a figure like a text object, collection objects. 


7.2 Basic plot in matplotlib 
We will start first in this chapter by the Pyplot module in matplotlib. This 
module offers the basic functions to supplement components to the current 
axes of a figure. To use this module, it should be imported as follows: 
>>> import numpy as np 
>>> import matplotlib.pyplot as plt 
Note here we imported the Numpy library as well because we will be 
working with numpy arrays. 
Now we can create a single plot of a data using the function plot(). Let’s 
create a series of data and plot these data. 
>>> X = np.array([1,2,3,4]) 
>>> Y=X **2 
>>> plt.plot(X, Y) 
>>> plt.show() 





10 15 20 25 3.0 35 40 


In this example, we created an array of values and computed the square of 
each value. The plot function is supplied with 2 inputs where the 1“ 
argument is values of X-axis and the 2"! argument is corresponding values 
of Y-axis. Now it would be helpful to understand the plotting if we had a 
legend of the axis and a title for the plot. To add these elements into our 
plotting, we can use the xlabel() function that adds a label to the x-axis and 
ylabel() function that adds a label to the y-axis. The title() function adds a 
title to the plot. 
>>> A = np.array([1,2,3,4]) 
>>>B=A**2 
>>> plt.plot(A,B) 
>>> plt.xlabel('A labels’) 
>>> plt.ylabel(‘B= A**2') 
>>> plt.title(My first in Python’) 
>>> plt.show() 
My first in Python 
16 
14 
12 


w 10 


Y= x** 
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Now we can change the size of the figure using the figure function and 
passing argument that specifies the size of the figure. For example, let’s 
change the size of the previous figure we created. 

>>> A = np.array([1,2,3,4]) 

>>> B=A** 2 

>>> plt.figure(figsize=(5,5)) 

>>> plt.plot(A,B) 

>>> plt.xlabel('A labels’) 

>>> plt.ylabel(‘B= A**2') 

>>> plt.title("My first in Python with different size’) 


>>> plt.show() 
My first in Python with different size 
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The plot function can take other input argument. In fact, we can plot two 
different datasets in the same plot. Let’s define compute the values of X ** 


3 and plot it in the same figure as an example. 

>>> A = np.array([1,2,3,4]) 
>>>B=A**2 

>>> B2=A ** 3 
>>> plt.figure(figsize=(10,5)) 
>>> plt.plot(A,B, A,B2) 
>>> plt.xlabel('A labels’) 
>>> plt.ylabel(‘B= A**2') 
>>> plt.title(My first in Python with two dataset’) 


>>> plt.show() 
My first in Python with different size 





x 
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Note that by default the plot function used a different color to plot the 
second dataset. Also, by default plot function draws the data as a line. In 
fact, we can pass another argument to the plot function that will specify 
how the data is plot. In other words, we specify if data is plotted as a line or 


using another marker such ‘+’, ‘*’, ‘0’. We can also specify the color. For 


instance, ‘go’ will make the plot function to use o to plot the data and the 
data will be plotted in green. We can also specify the line width if the data 
is plotted as a line. For example: 

>>> A = np.array([1,2,3,4]) 

>>> B=A ** 2 

>>> B2=A ** 3 

>>> plt.figure(figsize=(10,5)) 

>>> plt.plot(A,B,A,B2,linewidth=5) 

>>> plt.xlabel('A labels’) 

>>> plt.ylabel(B= A**2') 

>>> plt.title(My first in Python with two datasets and Line 
width=5') 

>>> plt.show() 


My first in Python with two datasets and different markers 
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The following example uses different markers to plot two datasets: 
>>> X = np.array([1,2,3,4]) 
>>> Y=X ** 2 
For Y2= Ke 3 
>>> plt.figure(figsize=(5,5)) 
>>> plt.plot(X, Y,'r*', X,Y2, 'ko') 


>>> plt.xlabel('X labels’) 

>>> plt.ylabel("Y= X**2') 

>>> plt.title(My first in Python with two datasets and different 
markers’) 

>>> plt.show() 


My first in Python with two datasets and different markers 








7.3 Multiple plots in same figure 

You plot several plots in the same figure using the subplot() function. Note 
the datasets that we plotted in the previous section in the same plot can be 
plotted in different plots in the same figure. The subplot() function takes as 
inputs the following arguments ncols, nrows and finally index. The ncols 
indicate the number of columns in the figure, nrows the numbers of rows in 
the figure and the index point toward which plot. For example, we can plot 
our two datasets in a figure with two rows as follows: 

>>> X = np.array([1,2,3,4]) 

Ses Y =X 2 

eee V2 = XS 3 

>>> plt.figure(figsize=(10,10)) 


>>> plt.subplot(2,1,1) 

>>> plt.plot(X, Y,linewidth=5) 

>>> plt.xlabel('X labels’) 

>>> plt.ylabel("Y= X**2') 

>>> plt.title(My first subplot in Python’) 
>>> plt.subplot(2,1,2) 

>>> plt.plot(X, Y2,linewidth=5) 

>>> plt.xlabel('X labels’) 

>>> plt.ylabel("Y= X**3’') 

>>> plt.title("My second subplot in Python’) 
>>> plt.show() 


My first subplot in Python 
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We can plot the two data set in a figure with two columns and two rows by 
passing as argument to the subplot (1,2,1) and (1,2,2) as follows: 

>>> X = np.array([1,2,3,4]) 

por YX Sh 2 

ao V2 =X" 3 

>>> plt.figure(figsize=(10,10)) 

>>> plt.subplot(1,2,1) 

>>> plt.plot(X, Y,linewidth=5) 

>>> plt.xlabel('X labels’) 

>>> plt.ylabel(Y= X**2') 


>>> plt.title(My first subplot in Python’) 
>>> plt.subplot(1,2,2) 

>>> plt.plot(X, Y2,linewidth=5) 

>>> plt.xlabel('X labels’) 

>>> plt.ylabel(Y= X**3') 

>>> plt.title("My second subplot in Python’) 
>>> plt.show() 
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7.4 ‘Type of plots 
The matplotlib offers several functions to create different graphs that are 
useful in data science and statistical analysis. The bar graphs are a handy 


graph to assess and compare different groups among data and explore their 
distribution. The bar() function take as input argument a set of categorial 
data and their associated values. It takes also optionally a color if you want 
to make a bar graph where each category is represented with a specific 
color. For example, let’s take the Iris data from the example in the previous 
chapter about Data frame data structure. Remember the Iris data is formed 
by a by a sample of 3 species of the Iris flower. Each species is described by 
sepals and petals length and sepals and petals width. This dataset is 
available in the sklearn library from which we are going to import the 
dataset. Because we will be using DataFrame structure we will import the 
Pandas library as well and we are going to create a Dataframe for the Iris 
dataset. 

>>> import pandas as pd 

>>> import numpy as np 

>>> from sklearn import datasets 

>>> Iris = datasets.load_iris() 
I >>> Iris_d = Iris.data 


>>> Iris_DF = pd.DataFrame(Iris_d, columns=Iris.feature_names) 


The Iris data set has also a variable associated with each value of the sepal 
length and width as well as lengths and width of the petal. This variable 
indicates the Iris follower’s species and is stored in the variable target. In 
the following command we are going to create a variable for this variable 
target: 

>>> Y = Iris.target 


Now that we have our data ready, we are going to plot a bar graph of the 
sepal length as follows: 
>>> plt.bar(Y, Iris_DF['sepal length (cm)']) 


>>> plt.title(’ Bar Graph of the Sepal length’) 
>>> plt.xlabel(' Iris Species’) 
>>> plt.ylabel(’ Count’) 
>>> plt.show() 


Bar Graph of the Sepal length 
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By default, Python plot figures in blue. We can change the color of the bars 


in the graph by passing the argument color as follows: 


>>> plt.bar(Y, Iris_DF['sepal length (cm)'], color='black’) 


>>> plt.title(’ Bar Graph of the Sepal length’) 
>>> plt.xlabel(' Iris Species’) 
>>> plt.ylabel(’ Count’) 
>>> plt.show() 


Bar Graph of the Sepal length 








We can also change the orientation of the bars from vertical bars to 
horizontal bars using the function barh(). The barh() function takes the 
Same input argument as the bar() function. Let’s plot horizontal bar graph 
for the sepal width length for each Iris species. 

>>> plt.barh(Y,Iris_DF['sepal length (cm)'], color='black’') 

>>> plt.title(’ Bar Graph of the Sepal length’) 

>>> plt.xlabel(' Iris Species’) 

>>> plt.ylabel(' Count’) 

>>> plt.show() 


Bar Graph of the Sepal length 











Iris Species 
We can also supply the bar() or the barh() function with an extra argument 
xerr or yerr(if using the bar() function) and its values. For example, if we 
want o plot also the variance of the variable for which the bar graph is 
plotted. For example, in the case of the sepal length we can do if we are 


using barh() function: 


>>> # Computing the variance with Numpy library 
>>> V = np.var(Iris_DF|'sepal length (cm)']) 
>>> plt.barh(Y, Iris_DF['sepal length (cm)'], xerr = V, color = 'grey') 
>>> plt.title(’ Bar Graph of the Sepal length with Variance’) 
>>> plt.xlabel(' Iris Species’) 
>>> plt.ylabel(' Count’) 
>>> plt.show() 


Bar Graph of the Sepal length with Variance 
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If the function bar() is used for vertical bars, to plot the variance with bars 
we pass as argument yerr as follows: 


>>> plt.bar(Y, Iris_DF['sepal length (cm)'], yerr = V, color = 'grey') 
>>> plt.title(’ Bar Graph of the Sepal length with Variance’) 
>>> plt.xlabel(' Iris Species’) 
>>> plt.ylabel(' Count’) 
>>> plt.show() 
Bar Graph of the Sepal length with Variance 
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Sometimes, we have a dataset with multiple variables and we want to create 
a single bar graph that shows the bars for different variables for each 
category like in the case of the Iris data set we are using in this chapter. To 
plot or stack multiple bars in the same graph, we need to use the bar() 
function as many times as the number of the variables for which the graph 
are plotted. In this case, we need to specify the index and width for the bars 
to stack them together. Let’s see how we can use this in order to plot in the 
same graph sepal length and sepal width bars. First, we need to group the 
Iris flowers according to the species and compute the mean sepal length and 
mean sepal width for each species. Remember we can do that using the 
groupby() function like we did in the previous chapter. We also need to add 
the Y variable which indicates the Iris species into our DataFrame. 
>>> data = Iris_DF 
>>> data["Y'] = Y # Adding the Y variable to the Dataframe 
>>> grouped_data = data.groupby(' Y ') # Grouping the Iris flowers 
according to the species 
>>> # Computing the mean sepal length and width for each species 
>>> M = grouped_data['sepal length (cm)'].agg(np.mean) 
>>> M2 = grouped_data['sepal width (cm)'].agg(np.mean) 
>>> print(' The mean sepal length (cm) for each species is:’, M) 


The mean sepal length (cm) for each species is: Y 
0 5.006 

1 5.936 

2 6.588 

Name: sepal length (cm), dtype: float64 


>>> print(' The mean width (cm) of the sepal for each species is:', 
M) 
The mean sepal width (cm) for each species is: Y 
0 5.006 
1 5.936 


2 6.588 
Name: sepal length (cm), dtype: float64 


Now that the data is ready, we plot the stacked bar graph as follows: 
>>> ind = np.arange(3) 
>>> width = 0.3 
>>> plt.bar (ind, M, width, color = 'grey’) 
>>> plt.bar (ind + width, M2, width, color = 'blue’) 
>>> plt.title(’ Bar Graph of the Sepal length and width (cm)') 
>>> plt.xlabel(' Iris Species’) 
>>> plt.ylabel(' Count’) 
>>> plt.show() 


Bar Graph of the Sepal length and width (cm) 
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We can add a legend to our graph using the legend() function in oder to 

distinguish between the graphs. We can also define specify in the graph the 

position of ticks in axis. The following statements how we can do that: 
>>> ind = np.arange(3) 


>>> width = 0.3 

>>> plt.bar (ind, M, width, color = 'grey’, label = 'Sepal length(cm)') 

>>> plt.bar (ind + width, M2, width, color = 'blue’, label = 'Sepal 
width(cm)') 

>>> plt.title(’ Bar Graph of the Sepal length and width (cm)') 

>>> plt.xlabel(' Iris Species’) 

>>> plt.ylabel(’ Count’) 

>>> plt.xticks(ind + width/2, ind) # Position of the xticks 


>>> plt.legend(loc = 'best') # Position of the legend 
>>> plt.show() 


Bar Graph of the Sepal length and width (cm) 
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We can also stack the bars vertically. In this case we pass an argument to 
the bar() function fro the second variable and specify the bar graph of the 


values below. For example, to stack vertically the sepal lenght and width we 
follow the code presented in below: 

>>> ind = np.arange(3) 

>>> width = 0.3 


>>> plt.bar (ind, M, width, color = 'grey’, label = 'Sepal length(cm)') 


>>> plt.bar (ind, M2, width, color = 'blue’, label = 'Sepal width(cm)’, 
bottom = M) 

>>> plt.title(’ Bar Graph of the Sepal length and width (cm)') 

>>> plt.xlabel(' Iris Species’) 

>>> plt.ylabel(’ Count’) 

>>> plt.xticks(ind + width/2, ind) # Position of the xticks 

>>> plt.legend(loc = 'best') # Position of the legend 


>>> plt.show() 


“ Bar Graph of the Sepal length and width (cm) 
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Histograms are another common graph in data analysis and statistical 
analysis that show the distribution of a variable. The histogram is a plot that 
shows the frequency of the values that a variable can take. In other words, 
we plot the range values of a variable against its frequency which describe 
the distribution of the variable. The hist() function allows to plot 
histograms with matplotlib library. For example, let’s plot the histogram of 
the sepal length of the Iris data: 

>>> plt.title(’ Histogram of the Iris sepal length’) 

>>> plt.xlabel (‘Sepal length (cm)') 


>>> plt.ylabel (' Frequency’) 
>>> plt.hist(Iris_DF['sepal length (cm)']) 


>>> plt.show() 
Histogram of the Iris sepal length 
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Like the other plot functions, color of the histogram can be changed by 
passing a color input argument like in the following example: 

>>> plt.title(’ Histogram of the Iris sepal length’) 

>>> plt.xlabel (‘Sepal length (cm)') 

>>> plt.ylabel (' Frequency’) 

>>> plt.hist(Iris_DF['sepal length (cm)'], color = 'grey’) 

>>> plt.show() 


Histogram of the Iris sepal length 
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In order to detect visually correlation between the variables we can use 
scatter plots that plots variables against each other in 2-dimensional space. 
To plot a scatter plot, we use the function scatter(). For example, we plot 
the sepal length against the sepal width: 

>>> plt.scatter(Iris_DF['sepal length (cm)'], Iris_DF['sepal width 
(cm)']) 

>>> plt.title('' Scatter plot of sepal length and sepal width’) 

>>> plt.xlabel(' Sepal length (cm)') 

>>> plt.ylabel (' Sepal width (cm)') 

>>> plt.show() 


Scatter plot of sepal length and sepal width 
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The scatter plot in the figure above is in a 2-dimensional space. We can also 
visualize the same scatter in a 3-dimensional space using scatter3D() 
function. This function is part of the mplot3d for 3 dimensional plots. So, 
we import first the module than plot the scatter plot in a 3-dimensional plot 
as follows: 
>>> from mpl_toolkits import mplot3d 
>>> ax = plt.axes(projection='3d') 
>>> ax.scatter3D (Iris_DF['sepal length (cm)'], Iris_DF['sepal width 
(cm)']) 
>>> ax.set_xlabel(' Sepal length (cm)') 
>>> ax.set_ylabel (' Sepal width (cm)') 
>>> ax.set_title(’ 3-D scatter plot of sepal length and sepal width’) 
>>> plt.show() 


3-D scatter plot of sepal length and sepal width 
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Conclusion 


Thank you for making it through to the end of Python for Data analysis: 
The crash Course for Beginners to Learn the Basics of Data Analysis with 
Python Database Management and Programming with Pandas, NumPy and 
Ipython, let’s hope it was informative and able to provide you with all of the 


tools you need to achieve your goals whatever they may be. 


This book presented an overview of Python programming and utilities of 
using Python to develop applications. The book also presented the basic 
programming syntax of Python for absolute beginners with no background 
in programming. In chapter 3 of this book, the basic data structures are 
presented as well as the operation that are available to manipulate these data 
structures. Chapter 5 to 7 provide intensive courses of how to use the most 
fundamentals libraries to master in data analysis. We presented the 
functionalities of the NumPy, Pandas and matplotlib libraries. These 
libraries provide efficient tools to handle, process and visualize large 


datasets. 


After finishing this book, you would develop skills in developing modules 
and functions in Python, loading and importing modules in Python. You 
would also develop skills in loading and exporting dataset from and to 
Python environments. You would also acquire skills in analysis and 
processing datasets using both libraries NumPy and Pandas by handling 
missing data and exploring datasets. You would develop skill in visualizing 
data using different type of graphs as well by mastering the functionalities 


of the matplotlib library. 


Overall this book provides a guide on to use these handy libraries in data 
analysis. Once you have acquired these skills and know the functionalities 
of the NumPy, Pandas and Matplotlib libraries, you will be able to analyze 
any data you have in hand using Python. You also develop more advanced 


skills to handle complex datasets. 


Finally, if you found this book useful in any way, a review on Amazon is 


always appreciated! 


